Source-Sink Analysis

For any edge weight threshold t, cutting the strongly connected components (SCC) tree provides the set of SCC of the original directed weighted network. Within each SCC subnetwork, there is a path from any node to any other node along the edges with weights not smaller than t. The original network may still contain the other edges with weights ≥t, which can connect one SCC subnetwork to the other, but it is not possible to reenter the same SCC again by traveling along these edges (otherwise by definition there is a bigger SCC). In other words, for each t we have a direct acyclic graph of connections between SCCs.

This property makes it very convenient to enumerate all paths from one set of nodes (sources) to the other (sinks). It is implemented by the flowgraph() method. One can use it to identify signaling subnetworks that connect one biological data (e.g. interactors of a particular protein) to another (e.g. downstream changes resulting from knock out or overexpressing this protein).

The output of flowgraph() is the subnetwork that consists of selected SCCs and the edges that connect these SCCs, plus the list of paths within this subnetwork from source to sink nodes. It could be shown that the path lengths tend to be smaller for the diffusion networks based on the real data than the ones based on the reshuffled node weights.

Network Metrics

It is important to have a way to check the relevance of HierarchicalHotNet predictions. For example, one can randomize the input data and show that, at some edge weight threshold $t_*$, HHotNet predictions based on the real data, $H(\mathcal{D}_{\mathrm{real}}, t_*)$, demnostrate significantly more "order" than the ones based on randomized data, $H(\mathcal{D}_{\mathrm{perm}}, t_*)$. If the "order" could be expressed as some metric $m(H)$, then we can define the $p$-value for the hypothesis that $H(\mathcal{D}_{\mathrm{real}}, t)$ is significantly more "ordered" than expected by chance:

\[p_m(H(\mathcal{D}_{\mathrm{real}}, t)) = P\big(m(H(\mathcal{D}_{\mathrm{perm}}, t_*)) \geq m(H(\mathcal{D}_{\mathrm{real}}, t)) \big).\]

$p_m$ could be approximated by generating a family HHotNet prediction based on randomized data, $H(\mathcal{D}_{\mathrm{perm}}^i, t))$, $i = 1, 2, \ldots, N_{\mathrm{perm}}$. The definition above was given for the metric $m(H)$ that increases as $H$ becomes more "ordered". If the metric $m(H)$ decreases as the "order" of $H$ grows, $P(M \geq m(H))$ should be changed to $P(M \leq m(H))$.

In M.A. Reyna et al (2018), the $m_{\max\mathrm{size}}(H)$ metric – the size of the maximal strongly connected component – was used, and it was shown that $m_{\max\mathrm{size}}(H(\mathcal{D}_{\mathrm{real}}, t))$ is statistically significantly larger than at random for some $t_*$. treecut_stats() provides $m_{\max\mathrm{size}}(t)$ in maxcomponent_size column. It is also possible to use the total size of top N components (topn_components_sizesum column) or the number of non-trivial (larger than single node) SCCs (ncomponents_nontrivial column) for the same purpose.

Flow Metrics

Metrics like $m_{\max\mathrm{size}}(H)$ could be used for the analysis of significantly perturbed subnetworks, but they don't allow estimating how strong is the relationship between sources and sinks. The source-to-sink flows analysis requires different metrics. The one that shows good results is the average inverse of flow length:

\[L_{\mathrm{avg}}^{-1}(H) = \frac{1}{N_{\mathrm{source}} \cdot N_{\mathrm{sink}}} \sum_{f \in \mathrm{flows}(H)} \frac{1}{N_{\mathrm{SCC}}(f)},\]

where $N_{\mathrm{source}}$ and $N_{\mathrm{sink}}$ are the numbers of sources and sinks, respectively (so $N_{\mathrm{source}} \cdot N_{\mathrm{sink}}$ is the number of all possible distinct flows), $\mathrm{flows}(H)$ are all the source-to-sink flows of $H$, and $N_{\mathrm{SCC}}(f)$ is the number of distinct strongly connected components that contain the nodes of the $f$ flow. This metric changes from 1 (all sources and sinks in the same strongly connected component) to 0 (no or infinitely long flows). This metric is available as flow_avginvlen column in the treecut_stats() output.

The alternative metrics for the flow analysis provided by treecut_stats() are:

the average transition probability of the flow (flow_avgweight column):

\[w_{\mathrm{avg. flow}}(H) = \frac{1}{N_{\mathrm{src}} \cdot N_{\mathrm{sink}}} \sum_{f \in \mathrm{flows}(H)} \min \big( w(\mathrm{source}(f), \mathrm{sink}(f)), w_{\max} \big),\]

where $w(i, j)$ is the transition probability from $i$-th to $j$-th node in the random walk with restart. The distribution of $w(i, j)$ is essentially non-gaussian, and to limit the strong influence of a few "outliers" on the total $w_{\mathrm{avg. flow}}(H)$, some $w_{\max}$ constant is used.

the average flow transition probability per SCC (flow_avghopweight column):

\[w_{\mathrm{avg. hop}}(H) = \frac{1}{N_{\mathrm{src}} \cdot N_{\mathrm{sink}}} \sum_{f \in \mathrm{flows}(H)} \frac{\min \big( w(\mathrm{source}(f), \mathrm{sink}(f)), w_{\max} \big)}{N_{\mathrm{SCC}}(f)},\]

HierarchicalHotNet.flowgraph — Function

flowgraph(tree::SCCTree, adjmtx::AbstractMatrix,
          sources::AbstractVector{Int}, sinks::AbstractVector{Int},
          test::EdgeTest;
          kwargs...) -> Tuple{AbstractVector{CompDiedge}, AbstractVector{CompFlow}, IndicesPartition}

Constructs the flow graph from sources to sinks vertices using the strongly connected components tree and the graph defined by adjmtx. The adjmtx edge should pass the test to be included in the resulting graph.

source

HierarchicalHotNet.flowgraph! — Function

flowgraph!(subgraph::Union{AbstractVector{CompDiedge}, Nothing},
           flows::AbstractVector{CompFlow},
           comps::IndicesPartition,
           tree::SCCTree, adjmtx::AbstractMatrix,
           sources::AbstractVector{Int}, sinks::AbstractVector{Int},
           test::EdgeTest;
           kwargs...) -> Tuple{AbstractVector{CompDiedge}, AbstractVector{CompFlow}, IndicesPartition}

In-place version of HierarchicalHotNet.flowgraph.

source

HierarchicalHotNet.flowstats — Function

flowstats(comps::IndicesPartition, adjmtx::AbstractMatrix,
          sources::AbstractVector{Int}, sinks::AbstractVector{Int},
          test::EdgeTest;
          maxweight::Union{Number, Nothing} = nothing,
          used_sources::Union{AbstractVector{Int}, Nothing}=nothing,
          used_sinks::Union{AbstractVector{Int}, Nothing}=nothing) -> NamedTupe

Calculates statistics from the flows from sources to sinks vertices in the weighted directed graph defined by adjmtx and test and the comps vertex components.

Returns the NamedTuple with the following fields

nflows: the number of source → sink flows
ncompoflows: the number of unique component(source) → component(sink) pairs for each source → sink flow
flowlen_sum: total length of all flows (flow length = the number of SCCs it crosses)
compflowlen_sum: total length of flows (every unique pairs of source/sink components is counted once)
flowinvlen_sum: the sum of $1/\mathrm{flowlength}$
compflowinvlen_sum: the sum of $1/\mathrm{flowlength}$ (each unique source/sink component pair is counted once)
compflowlen_max:
floweight_sum:
floweight_sum:
compfloweight_sum:
flowavghopweight_sum:
ncompsources: the number of distinct SCCs that have source nodes
ncompsinks: the number of distinct SCCs that have sink nodes

Keyword arguments

maxweight: if specified, limits the flow weight by maxweight
used_sources::AbstractVector{Int}: if specified, acts as an output parameter that contains the sorted list of sources that have outgoing flows
used_sinks::AbstractVector{Int}: if specified, acts as an output parameter that contains the sorted list of sinks that have incoming flows

source

HierarchicalHotNet.traceflows — Function

traceflows(step_adjmtx::AbstractMatrix, steptest::EdgeTest,
           walk_adjmtx::AbstractMatrix, walktest::EdgeTest;
           kwargs...) -> Dict{Diedge, Partition{Int}}

Trace the random walk (specified by walk_adjmtx and walktest) steps in the original graph (given by step_adjmtx and steptest).

See HierarchicalHotNet.traceflows! for the detailed description.

source

HierarchicalHotNet.traceflows! — Function

traceflows!(flow2paths::AbstractDict{Diedge, Partition{Int}},
            step_adjmtx::AbstractMatrix,
            steptest::EdgeTest,
            walk_adjmtx::AbstractMatrix,
            walktest::EdgeTest;
            sources::Union{AbstractVector, AbstractSet, Nothing} = nothing,
            sinks::Union{AbstractVector, AbstractSet, Nothing} = nothing,
            maxsteps::Integer=2) -> Dict{Diedge, Partition{Int}}

Trace the random walk (specified by walk_adjmtx and walktest) steps in the original graph (given by step_adjmtx and steptest).

Keyword arguments

sources (optional): the indices of vertices to use as path starts. If not specified, all vertices are used as path starts.
sinks (optional): the indices of vertices to use as path ends. If not specified, all vertices are used as path ends.
maxsteps=2: maximal number of steps in a traced path, longer paths are discarded.

Returns the mapping from the flow diedges to the HierarchicalHotNet.Partition object. Each part corresponds to the path, from diedge start to diedge end, in the original network, and the part elements are the indices of the intermedidate vertices along the path (start and end not included).

source