Sets collection

Sets mosaic

Gene Ontologies provide thousands of annotation terms. However, different genes often share the same annotations. This observation could be used to improve the processing of large collections. The genes could be grouped into disjoint "tiles" so that:

  • all genes of the same tile share the same annotations
  • the tiles don't overlap and their mosaic covers all the genes
  • any annotation term could be represented as the (disjoint) union of the tiles

This "mosaic" representation of the sets collection is implemented by the SetMosaic type. Internally it uses SparseMaskMatrix to maintain efficient element-to-set, element-to-tile etc mappings. Since there are (much) fewer tiles than individual elements, operations like sets union or intersection could be made much faster.

OptEnrichedSetCover.SetMosaicType
SetMosaic{T,S}

Represents a collection of (potentially overlapping) sets as a "mosaic" of non-overlapping "tiles".

Type parameters

  • T: type of set elements
  • S: type of set keys
source
OptEnrichedSetCover.ntilesFunction
nsets(mosaic::SetMosaic) -> Int

The number of tiles (pairwise disjoint sets) in the mosaic representation of the set collection.

source
OptEnrichedSetCover.setFunction
set(mosaic::SetMosaic, i::Integer) -> AbstractVector{Int}

Get the i-th set as the vector of indices of its elements.

source
OptEnrichedSetCover.tileFunction
tile(mosaic::SetMosaic, i::Integer) -> AbstractVector{Int}

Get the i-th mosaic tile as the vector of indices of its elements.

source

Set relevance

In the normal experiments, even the high-throughput ones, it's not possible to detect all annotated entities (genes or proteins). There's e.g. detection bias due to the experimental protocol or the sensitivity limit of the instrument. This must be taken into account when estimating, whether a particular gene set is enriched among e.g. upregulated genes – only the detected genes should be considered for the enrichment scores.

It may happen that the two distinct annotation terms share the same set of observed genes. In that case, their enrichment scores would be identical. If the enrichment is significant, both terms could be included in the report, but that would increase its redundancy. To solve this issue, OptEnrichedSetCover introduces the set relevance score. For example, if both "ribosome" and "small ribosomal subunits" terms are significant, but the genes of the large ribosomal subunit are not detected in the data, it's natural to prefer the "small ribosomal subunit" term over the whole ribosome. The relevance score formalizes that by estimating the enrichment of detected entities within each annotation term. See cover quality section for the exact definition of the relevance score.

OptEnrichedSetCover.set_relevanceFunction
set_relevance(nset_observed::Integer, nset::Integer,
              nobserved::Integer, ntotal::Integer) -> Float64

Calculates the relevance weight of the set that contains nset elements, nset_observed of which were present (not necessarily enriched) in the data that identified nobserved elements out of all known (ntotal). It is used by SetMosaic to penalize the sets, which could not be observed in the data (e.g. biological processes or pathways that involve proteins not expressed by the cells used in the experiments).

While for MaskedSetMosaic it's recommended to use the IDs of data entities (e.g. protein group IDs for proteomic data) to correctly count the set sizes and estimate enrichment; set_relevance() should use the counts derived from the original IDs of the annotation database (e.g. UniProt accession codes). Otherwise it's not possible to correctly estimate the number of elements that belong to the given annotated set, but were not observed in the data.

The returned value is the probability that no more than nset_observed elements were observed at random.

source

Masked sets mosaic

Missing docstring.

Missing docstring for MaskedSetMosaic. Check Documenter's build log for details.

OptEnrichedSetCover.maskFunction
mask(mosaic::SetMosaic, elmasks;
     mask_ids::Union{AbstractVector, AbstractSet, Nothing} = nothing,
     [min_nmasked=1], [max_setsize=nothing],
     [max_overlap_logpvalue=0.0]) -> MaskedSetMosaic

Construct MaskedSetMosaic from the [SetMosaic] and the collection of element masks.

Arguments

  • min_nmasked: the minimal number of masked elements in a set to include in the mosaic
  • max_setsize (optional): ignore the annotation sets bigger than the specified size
  • max_overlap_logpvalue: the threshold of Fisher's Exact Test log P-value of the overlap between the set and the mask for the inclusion of the set into the mosaic. 0 accepts all sets.
source