4. Adjacency

Within connected component, subcomponents can be further compartmentalised. The number of subcomponents with its central node (the highest count) one edge away from the rest of nodes is treated as the deduplicated molecule count, leading to the Adjacency method in UMI-tools¹.

Image title — **Fig** 1. A 7-node UMI graph

The Adjacency method draws on the information about the count of UMIs. In the example graph, the counts of the 7 unique UMIs are given in node_val_sorted.

node_val_sorted = pd.Series({
    'A': 120,
    'D': 90,
    'E': 50,
    'G': 5,
    'B': 2,
    'C': 2,
    'F': 1,
})

It also requires the connected components as input because those are what the Adjacency method classifies further into connected subcomponents.

Python

import umiche as uc

ccs = uc.dedup.cluster(
    graph=graph_adj,
    method='deque',
)

dedup_res = uc.dedup.adjacency(
    connected_components=ccs,
    df_umi_uniq_val_cnt=node_val_sorted,
    graph_adj=graph_adj
)
dedup_count = dedup_res['count']
dedup_clusters = dedup_res['clusters']
print("deduplicated count:\n{}".format(dedup_count))
print("deduplicated clusters:\n{}".format(dedup_clusters))

console

30/07/2024 15:44:54 logger: ======>connected_components: ['A', 'B', 'C', 'D', 'E', 'F', 'G']
30/07/2024 15:44:54 logger: =========>subcomponents: {'A': ['B']}
30/07/2024 15:44:54 logger: =========>subcomponents: {'A': ['B', 'C']}
30/07/2024 15:44:54 logger: =========>subcomponents: {'A': ['B', 'C', 'D']}
30/07/2024 15:44:54 logger: ======>The ccurent ele popping out: A {'C', 'B', 'A', 'D'}
30/07/2024 15:44:54 logger: =========>subcomponents: {'A': ['B', 'C', 'D'], 'D': ['E']}
30/07/2024 15:44:54 logger: =========>subcomponents: {'A': ['B', 'C', 'D'], 'D': ['E', 'F']}
30/07/2024 15:44:54 logger: ======>The ccurent ele popping out: D {'F', 'E', 'B', 'C', 'A', 'D'}
30/07/2024 15:44:54 logger: =========>subcomponents: {'A': ['B', 'C', 'D'], 'D': ['A', 'E', 'F'], 'E': ['G']}
30/07/2024 15:44:54 logger: ======>The ccurent ele popping out: E {'G', 'F', 'E', 'B', 'C', 'A', 'D'}
deduplicated count:
3
deduplicated clusters:
{'cc_0': {'node_A': ['A', 'B', 'C'], 'node_D': ['D', 'F'], 'node_E': ['E', 'G']}}
deduplicated clusters decomposed:
{0: ['A', 'B', 'C'], 1: ['D', 'F'], 2: ['E', 'G']}

Deduplicated UMI count

There are 3 connected subcomponents (node_A, node_D, and node_E) in connected component cc_0 and therefore, the deduplicated count of UMIs from 7 unique UMIs is 3 at the single locus.

We implemented the method based on the example 6-node UMI graph from the UMI-tools paper.

Abstract

The method aims to accurately resolve complex networks by analyzing node counts. It begins by removing the most abundant node and all nodes connected to it. If not all nodes in the network are accounted for, the next most abundant node and its neighbors are removed. This process is repeated until all nodes in the network are addressed. The total number of steps required to resolve the networks at a given locus corresponds to the estimated number of unique molecules.

Python

def adjacency(
        connected_components,
        df_umi_uniq_val_cnt,
        graph_adj,
):
    tcl = []
    cc_subs = {}
    for i, cc in connected_components.items():
        # print('cc: ', cc)
        step, cc_sub = adjacency_search(df_umi_uniq_val_cnt=df_umi_uniq_val_cnt, cc=cc, graph_adj=graph_adj)
        tcl.append(step)
        cc_subs['cc_' + str(i)] = cc_sub
        # print(self.umi_tools(cc_sorted))
    return {
        'count': sum(tcl),
        'clusters': cc_subs,
    }

def adjacency_search(
        df_umi_uniq_val_cnt,
        cc,
        graph_adj,
):
    cc_umi_sorted = df_umi_uniq_val_cnt.loc[df_umi_uniq_val_cnt.index.isin(cc)].sort_values(ascending=False).to_dict()
    ### @@ cc_umi_sorted
    # {'A': 456, 'E': 90, 'D': 72, 'B': 2, 'C': 2, 'F': 1}
    cc_sorted = [*cc_umi_sorted.keys()]
    ### @@ cc_sorted
    # ['A', 'E', 'D', 'B', 'C', 'F']
    visited = set()
    step = 1
    subcomponents = {}
    cc_set = set(cc_sorted)
    while cc_sorted:
        e = cc_sorted.pop(0)
        subcomponents[e] = []
        for node in graph_adj[e]:
            if node not in visited:
                subcomponents[e].append(node)
                # print(subcomponents)
        ### @@ e, subcomponents[e]
        # A ['B', 'C', 'D']
        # E []
        # D ['F']
        visited.add(e)
        ### @@ e, visited
        # A {'A'}
        # E {'B', 'A', 'E', 'C', 'D'}
        # D {'B', 'A', 'E', 'C', 'D'}
        visited.update(graph_adj[e])
        ### @@ e, visited
        # A {'B', 'D', 'A', 'C'}
        # E {'B', 'A', 'E', 'C', 'D'}
        # D {'B', 'F', 'A', 'E', 'C', 'D'}
        subcomponents[e] = graph_adj[e]
        ### @@ e, subcomponents[e]
        # A ['B', 'C', 'D']
        # E ['D']
        # D ['A', 'E', 'F']
        # print('the ccurent ele popping out: {} {}'.format(e, visited))
        if visited == cc_set:
            # print(step)
            break
        else:
            step += 1
    ### @@ subcomponents
    # {'A': ['B', 'C', 'D'], 'E': ['D'], 'D': ['A', 'E', 'F']}
    vertex = [*subcomponents.keys()]
    ### @@ vertex
    # ['A', 'E', 'D']
    cc_sub = {}
    for k, v in subcomponents.items():
        cc_sub['node_' + str(k)] = [k]
        for i in v:
            if i not in vertex:
                cc_sub['node_' + str(k)].append(i)
    return step, cc_sub

Smith T, Heger A, Sudbery I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res. 2017 Mar;27(3):491-499. doi: 10.1101/gr.209601.116. Epub 2017 Jan 18. PMID: 28100584; PMCID: PMC5340976. ↩