Sets collection
Sets mosaic
Gene Ontologies provide thousands of annotation terms. However, different genes often share the same annotations. This observation could be used to improve the processing of large collections. The genes could be grouped into disjoint "tiles" so that:
- all genes of the same tile share the same annotations
- the tiles don't overlap and their mosaic covers all the genes
- any annotation term could be represented as the (disjoint) union of the tiles
This "mosaic" representation of the sets collection is implemented by the SetMosaic
type. Internally it uses SparseMaskMatrix
to maintain efficient element-to-set, element-to-tile etc mappings. Since there are (much) fewer tiles than individual elements, operations like sets union or intersection could be made much faster.
OptEnrichedSetCover.SetMosaic
— TypeSetMosaic{T,S}
Represents a collection of (potentially overlapping) sets as a "mosaic" of non-overlapping "tiles".
Type parameters
T
: type of set elementsS
: type of set keys
OptEnrichedSetCover.nelements
— Functionnelements(mosaic::SetMosaic) -> Int
The number of distinct elements (i.e. genes) in the sets collection.
OptEnrichedSetCover.nsets
— Functionnsets(mosaic::SetMosaic) -> Int
The number of sets in the collection.
OptEnrichedSetCover.ntiles
— Functionnsets(mosaic::SetMosaic) -> Int
The number of tiles (pairwise disjoint sets) in the mosaic representation of the set collection.
OptEnrichedSetCover.set
— Functionset(mosaic::SetMosaic, i::Integer) -> AbstractVector{Int}
Get the i
-th set as the vector of indices of its elements.
OptEnrichedSetCover.setsize
— Functionset(mosaic::SetMosaic, i::Integer) -> Int
Get the size of the i
-th set.
OptEnrichedSetCover.tile
— Functiontile(mosaic::SetMosaic, i::Integer) -> AbstractVector{Int}
Get the i
-th mosaic tile as the vector of indices of its elements.
Set relevance
In the normal experiments, even the high-throughput ones, it's not possible to detect all annotated entities (genes or proteins). There's e.g. detection bias due to the experimental protocol or the sensitivity limit of the instrument. This must be taken into account when estimating, whether a particular gene set is enriched among e.g. upregulated genes – only the detected genes should be considered for the enrichment scores.
It may happen that the two distinct annotation terms share the same set of observed genes. In that case, their enrichment scores would be identical. If the enrichment is significant, both terms could be included in the report, but that would increase its redundancy. To solve this issue, OptEnrichedSetCover introduces the set relevance score. For example, if both "ribosome" and "small ribosomal subunits" terms are significant, but the genes of the large ribosomal subunit are not detected in the data, it's natural to prefer the "small ribosomal subunit" term over the whole ribosome. The relevance score formalizes that by estimating the enrichment of detected entities within each annotation term. See cover quality section for the exact definition of the relevance score.
OptEnrichedSetCover.set_relevance
— Functionset_relevance(nset_observed::Integer, nset::Integer,
nobserved::Integer, ntotal::Integer) -> Float64
Calculates the relevance weight of the set that contains nset
elements, nset_observed
of which were present (not necessarily enriched) in the data that identified nobserved
elements out of all known (ntotal
). It is used by SetMosaic
to penalize the sets, which could not be observed in the data (e.g. biological processes or pathways that involve proteins not expressed by the cells used in the experiments).
While for MaskedSetMosaic
it's recommended to use the IDs of data entities (e.g. protein group IDs for proteomic data) to correctly count the set sizes and estimate enrichment; set_relevance()
should use the counts derived from the original IDs of the annotation database (e.g. UniProt accession codes). Otherwise it's not possible to correctly estimate the number of elements that belong to the given annotated set, but were not observed in the data.
The returned value is the probability that no more than nset_observed
elements were observed at random.
OptEnrichedSetCover.logpvalue
— Functionlogpvalue(nisect::Integer, na::Integer, nb::Integer, ntotal::Integer,
[tail::Symbol = :right])
Log P-value for the two sets intersection.
A
has na
elemnts, B has nb
elements, they have nisect
elements in common, and there are ntotal
elements in the "universe".
tail
controls the null hypothesis:
:right
(default): by chanceA
andB
would have ≥ elements in common:left
: by chanceA
andB
would have ≤ elements in common:both
: by chanceA
andB
would have either ≤ or ≥ elements in common, whichever is less probable
Masked sets mosaic
OptEnrichedSetCover.AbstractWeightedSetMosaic
— TypeSetMosaic
with the weights assigned to its sets.
Type parameters
T
: the type of elementsS
: the type of set idsE
: the type of experiment idsW
: type of the weight
OptEnrichedSetCover.originalmosaic
— Functionoriginalmosaic(mosaic::AbstractWeightedSetMosaic) -> SetMosaic
Get the original SetMosaic
.
OptEnrichedSetCover.MaskedSetMosaic
— TypeSetMosaic
with the elements masks (selections) on top. Sets that are not overlapping with the masks are excluded(skipped) from MaskedSetMosaic
. Optionally, the filtering can include testing for the minimal overlap significance P-value.
The tiles of non-overlapped sets are removed, the tiles that have identical membership for all the masked sets are squashed into a single tile.
Type parameters
T
: the type of elementsS
: the type of set idsE
: the type of experiment ids
OptEnrichedSetCover.mask
— Functionmask(mosaic::SetMosaic, elmasks::AbstractMatrix{Bool};
[experiment_ids::Union{AbstractVector, AbstractSet, Nothing} = nothing],
[min_nmasked=1], [max_setsize=nothing],
[max_overlap_logpvalue=0.0]) -> MaskedSetMosaic
Construct MaskedSetMosaic
from the SetMosaic
and the collection of element masks.
Arguments
min_nmasked
: the minimal number of masked elements in a set to include in the mosaicmax_setsize
(optional): ignore the annotation sets bigger than the specified sizemax_overlap_logpvalue
: the threshold of Fisher's Exact Test log P-value of the overlap between the set and the mask for the inclusion of the set into the mosaic. 0 accepts all sets.
OptEnrichedSetCover.WeightedSetMosaic
— TypeSetMosaic
with the weights for the sets from multiple experiments on top.
Type parameters
T
: the type of elementsS
: the type of set idsE
: the type of experiment ids
OptEnrichedSetCover.assignweights
— Functionassignweights(mosaic::SetMosaic, elmasks::AbstractMatrix{Bool};
[experiment_ids::Union{AbstractVector, AbstractSet, Nothing} = nothing],
[max_setsize=nothing],
[max_weight], [max_min_weight]) -> WeightedSetMosaic
Construct WeightedSetMosaic
from the SetMosaic
and the external weights
.
Arguments
max_setsize
(optional): ignore the annotation sets bigger than the specified sizemax_weight
(optional): the maximal weight of the set to include in the mosaicmax_min_weight
(optional): the maximual weight of the set in all experiments to include the set into mosaic