6. Markov clustering
We developed mclUMI for automatic detection of UMI clusters by the Markov clustering algorithm1 without the need for calculating UMI counts, leading to the MCL method. It has two derivatives (MCL-ed and MCL-val) by considering the information about UMI counts.
MCL
¶
The MCL
method collapses UMIs without the information about the count of UMIs.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
It outputs some general information about the inputs and UMI graphs.
console
Index(['cc_vertices', 'graph_cc_adj', 'nt_to_int_map', 'int_to_nt_map',
'cc_adj_mat', 'mcl_clusters', 'clusters', 'clust_num', 'edge_list',
'apv'],
dtype='object')
vertices: ['A', 'B', 'C', 'D', 'E', 'F', 'G']
{'A': ['B', 'C', 'D'], 'B': ['A', 'C'], 'C': ['A', 'B'], 'D': ['A', 'E', 'F'], 'E': ['D', 'G'], 'F': ['D', 'G'], 'G': ['E', 'F']}
{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6}
{0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G'}
Then, we can see the MCL clustered results.
1 2 3 4 5 6 7 8 |
|
We can visually get a feeling of the three nodes 'A', 'B', 'C' being represented in a small cluster while 'D', 'E', 'F', 'G' being represented in another small cluster from its adjacency matrix dedup_res.loc[0, 'cc_adj_mat']
.
console
[[0. 1. 1. 1. 0. 0. 0.]
[1. 0. 1. 0. 0. 0. 0.]
[1. 1. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 1. 1. 0.]
[0. 0. 0. 1. 0. 0. 1.]
[0. 0. 0. 1. 0. 0. 1.]
[0. 0. 0. 0. 1. 1. 0.]]
[(0, 1, 2), (3, 4, 5, 6)]
[['A', 'B', 'C'], ['D', 'E', 'F', 'G']]
2
[list([['A', 'B', 'C'], ['D', 'E', 'F', 'G']])]
deduplicated clusters decomposed (mcl):
{0: ['A', 'B', 'C'], 1: ['D', 'E', 'F', 'G']}
Deduplicated UMI count
There are 2 connected subcomponents (0 and 1) and therefore, the deduplicated count of UMIs from 7 unique UMIs is 2 at the single locus.
MCL-val
¶
There exist edges between representative nodes from different Markov clusters if their count difference is roughly within t-fold, yielding a UMI count-based merging strategy (referred to as MCL-val).
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Setting thres_fold
as 2, we get the results that MCL-val keeps two clusters separated.
console
{'count': 0 2
Name: mscmv_val_len, dtype: int64, 'clusters': 0 [[A], [D]]
Name: mscmv_val_clusters, dtype: object, 'apv': 0 []
Name: mscmv_val_apv, dtype: object, 'disapv': 0 [[A, D]]
Name: mscmv_val_disapv, dtype: object}
deduplicated count (mcl_val):
2
deduplicated clusters (mcl_val):
[['A'], ['D']]
deduplicated clusters decomposed (mcl_val):
{0: ['A'], 1: ['D']}
However, if we change Setting thres_fold
as 1, we get a new UMI deduplication result, 1.
Python
1 2 3 4 5 6 7 8 9 10 11 |
|
MCL-ed
¶
There exist edges between representative nodes from different Markov clusters if they are within a minimal edit distance k, leading to a distance-based merging strategy (referred to as MCL-ed). This method needs UMI sequences as input because it will measure the edit distance between UMIs. Thus, we generate the sequences of the 7 unique UMIs.
int_to_umi_dict = {
'A': 'AGATCTCGCA',
'B': 'AGATCCCGCA',
'C': 'AGATCACGCA',
'D': 'AGATCGCGCA',
'E': 'AGATCGCGGA',
'F': 'AGATCGCGTA',
'G': 'TGATCGCGAA',
}
A threshold of edit distance can be set through thres_fold
. In this case, we set it as 1, which means that if the representative UMI in each cluster is >1 edit distance away from the other one, they will be considered as two PCR-amplified UMIs originating from the same root.
Then, we can use uc.dedup.mcl_ed
to obtain the edit distance-based collapsing results.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
By setting thres_fold
as 1, MCL-ed
merges two Markov clusters.
console
approval: 0 [[A, D]]
Name: mscmv_ed_apv, dtype: object
deduplicated count (mcl_ed):
0 1
Name: mscmv_ed_len, dtype: int64
deduplicated clusters (mcl_ed):
0 [[A, D]]
Name: mscmv_ed_clusters, dtype: object
deduplicated clusters decomposed (mcl_ed):
{0: ['A', 'D']}
However, if we change Setting thres_fold
as 0, the two clusters are reckoned differently.
Python
1 2 3 4 5 6 7 8 9 10 |
|
-
Satuluri V, Parthasarathy S, Ucar D. Markov clustering of protein interaction networks with improved balance and scalability. Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology [Internet]. New York, NY, USA: Association for Computing Machinery; 2010. p. 247–56. Available from: https://doi.org/10.1145/1854776.1854812 ↩