Skip to content

6. Markov clustering

We developed mclUMI for automatic detection of UMI clusters by the Markov clustering algorithm1 without the need for calculating UMI counts, leading to the MCL method. It has two derivatives (MCL-ed and MCL-val) by considering the information about UMI counts.

MCL

The MCL method collapses UMIs without the information about the count of UMIs.

Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import umiche as uc

ccs = uc.dedup.cluster(
    graph=graph_adj,
    method='deque',
)

dedup_res = uc.dedup.mcl(
    inflat_val=1.6,
    exp_val=2,
    iter_num=100,
    connected_components=ccs,
    graph_adj=graph_adj,
    verbose=True,
)

print(dedup_res.columns)
print("vertices: {}".format(dedup_res.loc[0, 'cc_vertices']))
print(dedup_res.loc[0, 'graph_cc_adj'])
print(dedup_res.loc[0, 'nt_to_int_map'])
print(dedup_res.loc[0, 'int_to_nt_map'])

It outputs some general information about the inputs and UMI graphs.

console

Index(['cc_vertices', 'graph_cc_adj', 'nt_to_int_map', 'int_to_nt_map',
       'cc_adj_mat', 'mcl_clusters', 'clusters', 'clust_num', 'edge_list',
       'apv'],
      dtype='object')
vertices: ['A', 'B', 'C', 'D', 'E', 'F', 'G']
{'A': ['B', 'C', 'D'], 'B': ['A', 'C'], 'C': ['A', 'B'], 'D': ['A', 'E', 'F'], 'E': ['D', 'G'], 'F': ['D', 'G'], 'G': ['E', 'F']}
{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6}
{0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G'}

Then, we can see the MCL clustered results.

1
2
3
4
5
6
7
8
print(dedup_res.loc[0, 'cc_adj_mat'])
print(dedup_res.loc[0, 'mcl_clusters'])
print(dedup_res.loc[0, 'clusters'])
print(dedup_res.loc[0, 'clust_num'])

print(dedup_res['clusters'].values)
dedup_clusters_dc = decompose_mcl(list_nd=dedup_res['clusters'].values)
print("deduplicated clusters decomposed (mcl):\n{}".format(dedup_clusters_dc))

We can visually get a feeling of the three nodes 'A', 'B', 'C' being represented in a small cluster while 'D', 'E', 'F', 'G' being represented in another small cluster from its adjacency matrix dedup_res.loc[0, 'cc_adj_mat'].

console

[[0. 1. 1. 1. 0. 0. 0.]
 [1. 0. 1. 0. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 1. 1. 0.]
 [0. 0. 0. 1. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0. 1.]
 [0. 0. 0. 0. 1. 1. 0.]]
[(0, 1, 2), (3, 4, 5, 6)]
[['A', 'B', 'C'], ['D', 'E', 'F', 'G']]
2
[list([['A', 'B', 'C'], ['D', 'E', 'F', 'G']])]
deduplicated clusters decomposed (mcl):
{0: ['A', 'B', 'C'], 1: ['D', 'E', 'F', 'G']}

Deduplicated UMI count

There are 2 connected subcomponents (0 and 1) and therefore, the deduplicated count of UMIs from 7 unique UMIs is 2 at the single locus.

MCL-val

There exist edges between representative nodes from different Markov clusters if their count difference is roughly within t-fold, yielding a UMI count-based merging strategy (referred to as MCL-val).

Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import umiche as uc

df_mcl_val = uc.dedup.mcl_val(
    df_mcl_ccs=dedup_res,
    df_umi_uniq_val_cnt=node_val_sorted,
    thres_fold=2,
)
print(df_mcl_val)
dedup_count = df_mcl_val['count'].values[0]
dedup_clusters = df_mcl_val['clusters'].values[0]
print("deduplicated count (mcl_val):\n{}".format(dedup_count))
print("deduplicated clusters (mcl_val):\n{}".format(dedup_clusters))

df_mcl_val = decompose_mcl(list_nd=df_mcl_val['clusters'].values)
print("deduplicated clusters decomposed (mcl_val):\n{}".format(df_mcl_val))

Setting thres_fold as 2, we get the results that MCL-val keeps two clusters separated.

console

{'count': 0    2
Name: mscmv_val_len, dtype: int64, 'clusters': 0    [[A], [D]]
Name: mscmv_val_clusters, dtype: object, 'apv': 0    []
Name: mscmv_val_apv, dtype: object, 'disapv': 0    [[A, D]]
Name: mscmv_val_disapv, dtype: object}
deduplicated count (mcl_val):
2
deduplicated clusters (mcl_val):
[['A'], ['D']]
deduplicated clusters decomposed (mcl_val):
{0: ['A'], 1: ['D']}

However, if we change Setting thres_fold as 1, we get a new UMI deduplication result, 1.

Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{'count': 0    1
Name: mscmv_val_len, dtype: int64, 'clusters': 0    [[A, D]]
Name: mscmv_val_clusters, dtype: object, 'apv': 0    [[A, D]]
Name: mscmv_val_apv, dtype: object, 'disapv': 0    []
Name: mscmv_val_disapv, dtype: object}
deduplicated count (mcl_val):
1
deduplicated clusters (mcl_val):
[['A', 'D']]
deduplicated clusters decomposed (mcl_val):
{0: ['A', 'D']}

MCL-ed

There exist edges between representative nodes from different Markov clusters if they are within a minimal edit distance k, leading to a distance-based merging strategy (referred to as MCL-ed). This method needs UMI sequences as input because it will measure the edit distance between UMIs. Thus, we generate the sequences of the 7 unique UMIs.

int_to_umi_dict = {
    'A': 'AGATCTCGCA',
    'B': 'AGATCCCGCA',
    'C': 'AGATCACGCA',
    'D': 'AGATCGCGCA',
    'E': 'AGATCGCGGA',
    'F': 'AGATCGCGTA',
    'G': 'TGATCGCGAA',
}

A threshold of edit distance can be set through thres_fold. In this case, we set it as 1, which means that if the representative UMI in each cluster is >1 edit distance away from the other one, they will be considered as two PCR-amplified UMIs originating from the same root.

Then, we can use uc.dedup.mcl_ed to obtain the edit distance-based collapsing results.

Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import umiche as uc

df_mcl_ed = uc.dedup.mcl_ed(
    df_mcl_ccs=dedup_res,
    df_umi_uniq_val_cnt=node_val_sorted,
    thres_fold=1,
    int_to_umi_dict=int_to_umi_dict,
)
dedup_count = df_mcl_ed['count']
dedup_clusters = df_mcl_ed['clusters']
print('approval: {}'.format(df_mcl_ed['apv']))

print("deduplicated count (mcl_ed):\n{}".format(dedup_count))
print("deduplicated clusters (mcl_ed):\n{}".format(dedup_clusters))

df_mcl_ed = decompose_mcl(list_nd=df_mcl_ed['clusters'].values)
print("deduplicated clusters decomposed (mcl_ed):\n{}".format(df_mcl_ed))

By setting thres_fold as 1, MCL-ed merges two Markov clusters.

console

approval: 0    [[A, D]]
Name: mscmv_ed_apv, dtype: object
deduplicated count (mcl_ed):
0    1
Name: mscmv_ed_len, dtype: int64
deduplicated clusters (mcl_ed):
0    [[A, D]]
Name: mscmv_ed_clusters, dtype: object
deduplicated clusters decomposed (mcl_ed):
{0: ['A', 'D']}

However, if we change Setting thres_fold as 0, the two clusters are reckoned differently.

Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
approval: 0    []
Name: mscmv_ed_apv, dtype: object
deduplicated count (mcl_ed):
0    2
Name: mscmv_ed_len, dtype: int64
deduplicated clusters (mcl_ed):
0    [[A], [D]]
Name: mscmv_ed_clusters, dtype: object
deduplicated clusters decomposed (mcl_ed):
{0: ['A'], 1: ['D']}


  1. Satuluri V, Parthasarathy S, Ucar D. Markov clustering of protein interaction networks with improved balance and scalability. Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology [Internet]. New York, NY, USA: Association for Computing Machinery; 2010. p. 247–56. Available from: https://doi.org/10.1145/1854776.1854812