6. Markov clustering

We developed mclUMI for automatic detection of UMI clusters by the Markov clustering algorithm¹ without the need for calculating UMI counts, leading to the MCL method. It has two derivatives (MCL-ed and MCL-val) by considering the information about UMI counts.

`MCL`¶

The MCL method collapses UMIs without the information about the count of UMIs.

Python

import umiche as uc

ccs = uc.dedup.cluster(
    graph=graph_adj,
    method='deque',
)

dedup_res = uc.dedup.mcl(
    inflat_val=1.6,
    exp_val=2,
    iter_num=100,
    connected_components=ccs,
    graph_adj=graph_adj,
    verbose=True,
)

print(dedup_res.columns)
print("vertices: {}".format(dedup_res.loc[0, 'cc_vertices']))
print(dedup_res.loc[0, 'graph_cc_adj'])
print(dedup_res.loc[0, 'nt_to_int_map'])
print(dedup_res.loc[0, 'int_to_nt_map'])

It outputs some general information about the inputs and UMI graphs.

console

Index(['cc_vertices', 'graph_cc_adj', 'nt_to_int_map', 'int_to_nt_map',
       'cc_adj_mat', 'mcl_clusters', 'clusters', 'clust_num', 'edge_list',
       'apv'],
      dtype='object')
vertices: ['A', 'B', 'C', 'D', 'E', 'F', 'G']
{'A': ['B', 'C', 'D'], 'B': ['A', 'C'], 'C': ['A', 'B'], 'D': ['A', 'E', 'F'], 'E': ['D', 'G'], 'F': ['D', 'G'], 'G': ['E', 'F']}
{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6}
{0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G'}

Then, we can see the MCL clustered results.

print(dedup_res.loc[0, 'cc_adj_mat'])
print(dedup_res.loc[0, 'mcl_clusters'])
print(dedup_res.loc[0, 'clusters'])
print(dedup_res.loc[0, 'clust_num'])

print(dedup_res['clusters'].values)
dedup_clusters_dc = decompose_mcl(list_nd=dedup_res['clusters'].values)
print("deduplicated clusters decomposed (mcl):\n{}".format(dedup_clusters_dc))

We can visually get a feeling of the three nodes 'A', 'B', 'C' being represented in a small cluster while 'D', 'E', 'F', 'G' being represented in another small cluster from its adjacency matrix dedup_res.loc[0, 'cc_adj_mat'].

console

[[0. 1. 1. 1. 0. 0. 0.]
 [1. 0. 1. 0. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 1. 1. 0.]
 [0. 0. 0. 1. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0. 1.]
 [0. 0. 0. 0. 1. 1. 0.]]
[(0, 1, 2), (3, 4, 5, 6)]
[['A', 'B', 'C'], ['D', 'E', 'F', 'G']]
2
[list([['A', 'B', 'C'], ['D', 'E', 'F', 'G']])]
deduplicated clusters decomposed (mcl):
{0: ['A', 'B', 'C'], 1: ['D', 'E', 'F', 'G']}

Deduplicated UMI count

There are 2 connected subcomponents (0 and 1) and therefore, the deduplicated count of UMIs from 7 unique UMIs is 2 at the single locus.

`MCL-val`¶

There exist edges between representative nodes from different Markov clusters if their count difference is roughly within t-fold, yielding a UMI count-based merging strategy (referred to as MCL-val).

Python

import umiche as uc

df_mcl_val = uc.dedup.mcl_val(
    df_mcl_ccs=dedup_res,
    df_umi_uniq_val_cnt=node_val_sorted,
    thres_fold=2,
)
print(df_mcl_val)
dedup_count = df_mcl_val['count'].values[0]
dedup_clusters = df_mcl_val['clusters'].values[0]
print("deduplicated count (mcl_val):\n{}".format(dedup_count))
print("deduplicated clusters (mcl_val):\n{}".format(dedup_clusters))

df_mcl_val = decompose_mcl(list_nd=df_mcl_val['clusters'].values)
print("deduplicated clusters decomposed (mcl_val):\n{}".format(df_mcl_val))

Setting thres_fold as 2, we get the results that MCL-val keeps two clusters separated.

console

{'count': 0    2
Name: mscmv_val_len, dtype: int64, 'clusters': 0    [[A], [D]]
Name: mscmv_val_clusters, dtype: object, 'apv': 0    []
Name: mscmv_val_apv, dtype: object, 'disapv': 0    [[A, D]]
Name: mscmv_val_disapv, dtype: object}
deduplicated count (mcl_val):
2
deduplicated clusters (mcl_val):
[['A'], ['D']]
deduplicated clusters decomposed (mcl_val):
{0: ['A'], 1: ['D']}

However, if we change Setting thres_fold as 1, we get a new UMI deduplication result, 1.

Python

{'count': 0    1
Name: mscmv_val_len, dtype: int64, 'clusters': 0    [[A, D]]
Name: mscmv_val_clusters, dtype: object, 'apv': 0    [[A, D]]
Name: mscmv_val_apv, dtype: object, 'disapv': 0    []
Name: mscmv_val_disapv, dtype: object}
deduplicated count (mcl_val):
1
deduplicated clusters (mcl_val):
[['A', 'D']]
deduplicated clusters decomposed (mcl_val):
{0: ['A', 'D']}

`MCL-ed`¶

There exist edges between representative nodes from different Markov clusters if they are within a minimal edit distance k, leading to a distance-based merging strategy (referred to as MCL-ed). This method needs UMI sequences as input because it will measure the edit distance between UMIs. Thus, we generate the sequences of the 7 unique UMIs.

int_to_umi_dict = {
    'A': 'AGATCTCGCA',
    'B': 'AGATCCCGCA',
    'C': 'AGATCACGCA',
    'D': 'AGATCGCGCA',
    'E': 'AGATCGCGGA',
    'F': 'AGATCGCGTA',
    'G': 'TGATCGCGAA',
}

A threshold of edit distance can be set through thres_fold. In this case, we set it as 1, which means that if the representative UMI in each cluster is >1 edit distance away from the other one, they will be considered as two PCR-amplified UMIs originating from the same root.

Then, we can use uc.dedup.mcl_ed to obtain the edit distance-based collapsing results.

Python

import umiche as uc

df_mcl_ed = uc.dedup.mcl_ed(
    df_mcl_ccs=dedup_res,
    df_umi_uniq_val_cnt=node_val_sorted,
    thres_fold=1,
    int_to_umi_dict=int_to_umi_dict,
)
dedup_count = df_mcl_ed['count']
dedup_clusters = df_mcl_ed['clusters']
print('approval: {}'.format(df_mcl_ed['apv']))

print("deduplicated count (mcl_ed):\n{}".format(dedup_count))
print("deduplicated clusters (mcl_ed):\n{}".format(dedup_clusters))

df_mcl_ed = decompose_mcl(list_nd=df_mcl_ed['clusters'].values)
print("deduplicated clusters decomposed (mcl_ed):\n{}".format(df_mcl_ed))

By setting thres_fold as 1, MCL-ed merges two Markov clusters.

console

approval: 0    [[A, D]]
Name: mscmv_ed_apv, dtype: object
deduplicated count (mcl_ed):
0    1
Name: mscmv_ed_len, dtype: int64
deduplicated clusters (mcl_ed):
0    [[A, D]]
Name: mscmv_ed_clusters, dtype: object
deduplicated clusters decomposed (mcl_ed):
{0: ['A', 'D']}

However, if we change Setting thres_fold as 0, the two clusters are reckoned differently.

Python

approval: 0    []
Name: mscmv_ed_apv, dtype: object
deduplicated count (mcl_ed):
0    2
Name: mscmv_ed_len, dtype: int64
deduplicated clusters (mcl_ed):
0    [[A], [D]]
Name: mscmv_ed_clusters, dtype: object
deduplicated clusters decomposed (mcl_ed):
{0: ['A'], 1: ['D']}

Satuluri V, Parthasarathy S, Ucar D. Markov clustering of protein interaction networks with improved balance and scalability. Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology [Internet]. New York, NY, USA: Association for Computing Machinery; 2010. p. 247–56. Available from: https://doi.org/10.1145/1854776.1854812 ↩

6. Markov clustering

MCL¶

MCL-val¶

MCL-ed¶

`MCL`¶

`MCL-val`¶

`MCL-ed`¶