Sets collection

Sets mosaic

Gene Ontologies provide thousands of annotation terms. However, different genes often share the same annotations. This observation could be used to improve the processing of large collections. The genes could be grouped into disjoint "tiles" so that:

all genes of the same tile share the same annotations
the tiles don't overlap and their mosaic covers all the genes
any annotation term could be represented as the (disjoint) union of the tiles

This "mosaic" representation of the sets collection is implemented by the SetMosaic type. Internally it uses SparseMaskMatrix to maintain efficient element-to-set, element-to-tile etc mappings. Since there are (much) fewer tiles than individual elements, operations like sets union or intersection could be made much faster.

OptEnrichedSetCover.SetMosaic — Type

SetMosaic{T,S}

Represents a collection of (potentially overlapping) sets as a "mosaic" of non-overlapping "tiles".

Type parameters

T: type of set elements
S: type of set keys

OptEnrichedSetCover.nelements — Function

nelements(mosaic::SetMosaic) -> Int

The number of distinct elements (i.e. genes) in the sets collection.

OptEnrichedSetCover.nsets — Function

nsets(mosaic::SetMosaic) -> Int

The number of sets in the collection.

OptEnrichedSetCover.ntiles — Function

nsets(mosaic::SetMosaic) -> Int

The number of tiles (pairwise disjoint sets) in the mosaic representation of the set collection.

OptEnrichedSetCover.set — Function

set(mosaic::SetMosaic, i::Integer) -> AbstractVector{Int}

Get the i-th set as the vector of indices of its elements.

OptEnrichedSetCover.setsize — Function

set(mosaic::SetMosaic, i::Integer) -> Int

Get the size of the i-th set.

OptEnrichedSetCover.tile — Function

tile(mosaic::SetMosaic, i::Integer) -> AbstractVector{Int}

Get the i-th mosaic tile as the vector of indices of its elements.

Set relevance

In the normal experiments, even the high-throughput ones, it's not possible to detect all annotated entities (genes or proteins). There's e.g. detection bias due to the experimental protocol or the sensitivity limit of the instrument. This must be taken into account when estimating, whether a particular gene set is enriched among e.g. upregulated genes – only the detected genes should be considered for the enrichment scores.

It may happen that the two distinct annotation terms share the same set of observed genes. In that case, their enrichment scores would be identical. If the enrichment is significant, both terms could be included in the report, but that would increase its redundancy. To solve this issue, OptEnrichedSetCover introduces the set relevance score. For example, if both "ribosome" and "small ribosomal subunits" terms are significant, but the genes of the large ribosomal subunit are not detected in the data, it's natural to prefer the "small ribosomal subunit" term over the whole ribosome. The relevance score formalizes that by estimating the enrichment of detected entities within each annotation term. See cover quality section for the exact definition of the relevance score.

OptEnrichedSetCover.set_relevance — Function

set_relevance(nset_observed::Integer, nset::Integer,
              nobserved::Integer, ntotal::Integer) -> Float64

Calculates the relevance weight of the set that contains nset elements, nset_observed of which were present (not necessarily enriched) in the data that identified nobserved elements out of all known (ntotal). It is used by SetMosaic to penalize the sets, which could not be observed in the data (e.g. biological processes or pathways that involve proteins not expressed by the cells used in the experiments).

While for MaskedSetMosaic it's recommended to use the IDs of data entities (e.g. protein group IDs for proteomic data) to correctly count the set sizes and estimate enrichment; set_relevance() should use the counts derived from the original IDs of the annotation database (e.g. UniProt accession codes). Otherwise it's not possible to correctly estimate the number of elements that belong to the given annotated set, but were not observed in the data.

The returned value is the probability that no more than nset_observed elements were observed at random.

OptEnrichedSetCover.logpvalue — Function

logpvalue(nisect::Integer, na::Integer, nb::Integer, ntotal::Integer,
          [tail::Symbol = :right])

Log P-value for the two sets intersection.

A has na elemnts, B has nb elements, they have nisect elements in common, and there are ntotal elements in the "universe".

tail controls the null hypothesis:

:right (default): by chance A and B would have ≥ elements in common
:left: by chance A and B would have ≤ elements in common
:both: by chance A and B would have either ≤ or ≥ elements in common, whichever is less probable

Masked sets mosaic

OptEnrichedSetCover.AbstractWeightedSetMosaic — Type

SetMosaic with the weights assigned to its sets.

Type parameters

T: the type of elements
S: the type of set ids
E: the type of experiment ids
W: type of the weight

OptEnrichedSetCover.originalmosaic — Function

originalmosaic(mosaic::AbstractWeightedSetMosaic) -> SetMosaic

Get the original SetMosaic.

OptEnrichedSetCover.MaskedSetMosaic — Type

SetMosaic with the elements masks (selections) on top. Sets that are not overlapping with the masks are excluded(skipped) from MaskedSetMosaic. Optionally, the filtering can include testing for the minimal overlap significance P-value.

The tiles of non-overlapped sets are removed, the tiles that have identical membership for all the masked sets are squashed into a single tile.

Type parameters

T: the type of elements
S: the type of set ids
E: the type of experiment ids

OptEnrichedSetCover.mask — Function

mask(mosaic::SetMosaic, elmasks::AbstractMatrix{Bool};
     [experiment_ids::Union{AbstractVector, AbstractSet, Nothing} = nothing],
     [min_nmasked=1], [max_setsize=nothing],
     [max_overlap_logpvalue=0.0]) -> MaskedSetMosaic

Construct MaskedSetMosaic from the SetMosaic and the collection of element masks.

Arguments

min_nmasked: the minimal number of masked elements in a set to include in the mosaic
max_setsize (optional): ignore the annotation sets bigger than the specified size
max_overlap_logpvalue: the threshold of Fisher's Exact Test log P-value of the overlap between the set and the mask for the inclusion of the set into the mosaic. 0 accepts all sets.

OptEnrichedSetCover.WeightedSetMosaic — Type

SetMosaic with the weights for the sets from multiple experiments on top.

Type parameters

T: the type of elements
S: the type of set ids
E: the type of experiment ids

OptEnrichedSetCover.assignweights — Function

assignweights(mosaic::SetMosaic, elmasks::AbstractMatrix{Bool};
             [experiment_ids::Union{AbstractVector, AbstractSet, Nothing} = nothing],
             [max_setsize=nothing],
             [max_weight], [max_min_weight]) -> WeightedSetMosaic

Construct WeightedSetMosaic from the SetMosaic and the external weights.

Arguments

max_setsize (optional): ignore the annotation sets bigger than the specified size
max_weight (optional): the maximal weight of the set to include in the mosaic
max_min_weight (optional): the maximual weight of the set in all experiments to include the set into mosaic