Source-Sink Analysis
For any edge weight threshold t, cutting the strongly connected components (SCC) tree provides the set of SCC of the original directed weighted network. Within each SCC subnetwork, there is a path from any node to any other node along the edges with weights not smaller than t. The original network may still contain the other edges with weights ≥t, which can connect one SCC subnetwork to the other, but it is not possible to reenter the same SCC again by traveling along these edges (otherwise by definition there is a bigger SCC). In other words, for each t we have a direct acyclic graph of connections between SCCs.
This property makes it very convenient to enumerate all paths from one set of nodes (sources) to the other (sinks). It is implemented by the flowgraph()
method. One can use it to identify signaling subnetworks that connect one biological data (e.g. interactors of a particular protein) to another (e.g. downstream changes resulting from knock out or overexpressing this protein).
The output of flowgraph()
is the subnetwork that consists of selected SCCs and the edges that connect these SCCs, plus the list of paths within this subnetwork from source to sink nodes. It could be shown that the path lengths tend to be smaller for the diffusion networks based on the real data than the ones based on the reshuffled node weights.
Network Metrics
It is important to have a way to check the relevance of HierarchicalHotNet predictions. For example, one can randomize the input data and show that, at some edge weight threshold $t_*$, HHotNet predictions based on the real data, $H(\mathcal{D}_{\mathrm{real}}, t_*)$, demnostrate significantly more "order" than the ones based on randomized data, $H(\mathcal{D}_{\mathrm{perm}}, t_*)$. If the "order" could be expressed as some metric $m(H)$, then we can define the $p$-value for the hypothesis that $H(\mathcal{D}_{\mathrm{real}}, t)$ is significantly more "ordered" than expected by chance:
\[p_m(H(\mathcal{D}_{\mathrm{real}}, t)) = P\big(m(H(\mathcal{D}_{\mathrm{perm}}, t_*)) \geq m(H(\mathcal{D}_{\mathrm{real}}, t)) \big).\]
$p_m$ could be approximated by generating a family HHotNet prediction based on randomized data, $H(\mathcal{D}_{\mathrm{perm}}^i, t))$, $i = 1, 2, \ldots, N_{\mathrm{perm}}$. The definition above was given for the metric $m(H)$ that increases as $H$ becomes more "ordered". If the metric $m(H)$ decreases as the "order" of $H$ grows, $P(M \geq m(H))$ should be changed to $P(M \leq m(H))$.
In M.A. Reyna et al (2018), the $m_{\max\mathrm{size}}(H)$ metric – the size of the maximal strongly connected component – was used, and it was shown that $m_{\max\mathrm{size}}(H(\mathcal{D}_{\mathrm{real}}, t))$ is statistically significantly larger than at random for some $t_*$. treecut_stats()
provides $m_{\max\mathrm{size}}(t)$ in maxcomponent_size
column. It is also possible to use the total size of top N components (topn_components_sizesum
column) or the number of non-trivial (larger than single node) SCCs (ncomponents_nontrivial
column) for the same purpose.
Flow Metrics
Metrics like $m_{\max\mathrm{size}}(H)$ could be used for the analysis of significantly perturbed subnetworks, but they don't allow estimating how strong is the relationship between sources and sinks. The source-to-sink flows analysis requires different metrics. The one that shows good results is the average inverse of flow length:
\[L_{\mathrm{avg}}^{-1}(H) = \frac{1}{N_{\mathrm{source}} \cdot N_{\mathrm{sink}}} \sum_{f \in \mathrm{flows}(H)} \frac{1}{N_{\mathrm{SCC}}(f)},\]
where $N_{\mathrm{source}}$ and $N_{\mathrm{sink}}$ are the numbers of sources and sinks, respectively (so $N_{\mathrm{source}} \cdot N_{\mathrm{sink}}$ is the number of all possible distinct flows), $\mathrm{flows}(H)$ are all the source-to-sink flows of $H$, and $N_{\mathrm{SCC}}(f)$ is the number of distinct strongly connected components that contain the nodes of the $f$ flow. This metric changes from 1 (all sources and sinks in the same strongly connected component) to 0 (no or infinitely long flows). This metric is available as flow_avginvlen
column in the treecut_stats()
output.
The alternative metrics for the flow analysis provided by treecut_stats()
are:
- the average transition probability of the flow (
flow_avgweight
column):
\[w_{\mathrm{avg. flow}}(H) = \frac{1}{N_{\mathrm{src}} \cdot N_{\mathrm{sink}}} \sum_{f \in \mathrm{flows}(H)} \min \big( w(\mathrm{source}(f), \mathrm{sink}(f)), w_{\max} \big),\]
where $w(i, j)$ is the transition probability from $i$-th to $j$-th node in the random walk with restart. The distribution of $w(i, j)$ is essentially non-gaussian, and to limit the strong influence of a few "outliers" on the total $w_{\mathrm{avg. flow}}(H)$, some $w_{\max}$ constant is used.
- the average flow transition probability per SCC (
flow_avghopweight
column):
\[w_{\mathrm{avg. hop}}(H) = \frac{1}{N_{\mathrm{src}} \cdot N_{\mathrm{sink}}} \sum_{f \in \mathrm{flows}(H)} \frac{\min \big( w(\mathrm{source}(f), \mathrm{sink}(f)), w_{\max} \big)}{N_{\mathrm{SCC}}(f)},\]
HierarchicalHotNet.flowgraph
— Functionflowgraph(tree::SCCTree, adjmtx::AbstractMatrix,
sources::AbstractVector{Int}, sinks::AbstractVector{Int},
test::EdgeTest;
kwargs...) -> Tuple{AbstractVector{CompDiedge}, AbstractVector{CompFlow}, IndicesPartition}
Constructs the flow graph from sources
to sinks
vertices using the strongly connected components tree
and the graph defined by adjmtx
. The adjmtx
edge should pass the test
to be included in the resulting graph.
HierarchicalHotNet.flowgraph!
— Functionflowgraph!(subgraph::Union{AbstractVector{CompDiedge}, Nothing},
flows::AbstractVector{CompFlow},
comps::IndicesPartition,
tree::SCCTree, adjmtx::AbstractMatrix,
sources::AbstractVector{Int}, sinks::AbstractVector{Int},
test::EdgeTest;
kwargs...) -> Tuple{AbstractVector{CompDiedge}, AbstractVector{CompFlow}, IndicesPartition}
In-place version of HierarchicalHotNet.flowgraph
.
HierarchicalHotNet.flowstats
— Functionflowstats(comps::IndicesPartition, adjmtx::AbstractMatrix,
sources::AbstractVector{Int}, sinks::AbstractVector{Int},
test::EdgeTest;
maxweight::Union{Number, Nothing} = nothing,
used_sources::Union{AbstractVector{Int}, Nothing}=nothing,
used_sinks::Union{AbstractVector{Int}, Nothing}=nothing) -> NamedTupe
Calculates statistics from the flows from sources to sinks vertices in the weighted directed graph defined by adjmtx and test and the comps vertex components.
Returns the NamedTuple with the following fields
nflows
: the number of source → sink flowsncompoflows
: the number of unique component(source) → component(sink) pairs for each source → sink flowflowlen_sum
: total length of all flows (flow length = the number of SCCs it crosses)compflowlen_sum
: total length of flows (every unique pairs of source/sink components is counted once)flowinvlen_sum
: the sum of $1/\mathrm{flowlength}$compflowinvlen_sum
: the sum of $1/\mathrm{flowlength}$ (each unique source/sink component pair is counted once)compflowlen_max
:floweight_sum
:floweight_sum
:compfloweight_sum
:flowavghopweight_sum
:ncompsources
: the number of distinct SCCs that have source nodesncompsinks
: the number of distinct SCCs that have sink nodes
Keyword arguments
maxweight
: if specified, limits the flow weight bymaxweight
used_sources::AbstractVector{Int}
: if specified, acts as an output parameter that contains the sorted list of sources that have outgoing flowsused_sinks::AbstractVector{Int}
: if specified, acts as an output parameter that contains the sorted list of sinks that have incoming flows
HierarchicalHotNet.traceflows
— Functiontraceflows(step_adjmtx::AbstractMatrix, steptest::EdgeTest,
walk_adjmtx::AbstractMatrix, walktest::EdgeTest;
kwargs...) -> Dict{Diedge, Partition{Int}}
Trace the random walk (specified by walk_adjmtx
and walktest
) steps in the original graph (given by step_adjmtx
and steptest
).
See HierarchicalHotNet.traceflows!
for the detailed description.
HierarchicalHotNet.traceflows!
— Functiontraceflows!(flow2paths::AbstractDict{Diedge, Partition{Int}},
step_adjmtx::AbstractMatrix,
steptest::EdgeTest,
walk_adjmtx::AbstractMatrix,
walktest::EdgeTest;
sources::Union{AbstractVector, AbstractSet, Nothing} = nothing,
sinks::Union{AbstractVector, AbstractSet, Nothing} = nothing,
maxsteps::Integer=2) -> Dict{Diedge, Partition{Int}}
Trace the random walk (specified by walk_adjmtx and walktest) steps in the original graph (given by step_adjmtx and steptest).
Keyword arguments
sources
(optional): the indices of vertices to use as path starts. If not specified, all vertices are used as path starts.sinks
(optional): the indices of vertices to use as path ends. If not specified, all vertices are used as path ends.maxsteps=2
: maximal number of steps in a traced path, longer paths are discarded.
Returns the mapping from the flow diedges to the HierarchicalHotNet.Partition
object. Each part corresponds to the path, from diedge start to diedge end, in the original network, and the part elements are the indices of the intermedidate vertices along the path (start and end not included).