9. Afinity propagation

Affinity Propagation¹ is a clustering technique that selects representative exemplars from a set of data points and forms clusters around them. Unlike traditional methods like k-means, which require the number of clusters to be predetermined, Affinity Propagation automatically determines the number of clusters based on the data. This method is especially advantageous for discovering clusters without specifying their count and is compatible with various distance metrics, including non-Euclidean ones.

Our implemented affinity propagation clustering of UMIs is a parameter-free method, which also takes UMI sequences as input.

int_to_umi_dict = {
    'A': 'AGATCTCGCA',
    'B': 'AGATCCCGCA',
    'C': 'AGATCACGCA',
    'D': 'AGATCGCGCA',
    'E': 'AGATCGCGGA',
    'F': 'AGATCGCGTA',
    'G': 'TGATCGCGAA',
}

We can use uc.dedup.affinity_propagation to perform the collapsing of UMIs.

Python

import umiche as uc

df = uc.dedup.affinity_propagation(
    connected_components=ccs,
    graph_adj=graph_adj,
    int_to_umi_dict=int_to_umi_dict,
)
# print(df)
for i, col in enumerate(df.columns):
    print("{}.{}: \n{}".format(i + 1, col, df[col].values[0]))
df_decomposed = decompose_mcl(list_nd=df['clusters'].values)
print("deduplicated clusters decomposed:\n{}".format(df_decomposed))

console

1.cc_vertices: 
['A', 'B', 'C', 'D', 'E', 'F', 'G']
2.umi: 
['AGATCTCGCA', 'AGATCCCGCA', 'AGATCACGCA', 'AGATCGCGCA', 'AGATCGCGGA', 'AGATCGCGTA', 'TGATCGCGAA']
3.onehot: 
[array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0]), array([0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])]
4.clustering_clusters: 
[array([0, 1, 0, 1, 2, 2, 3])]
5.clusters: 
[['A', 'C'], ['B', 'D'], ['E', 'F'], ['G']]
6.clust_num: 
4
7.graph_cc_adj: 
{'A': ['B', 'C', 'D'], 'B': ['A', 'C'], 'C': ['A', 'B'], 'D': ['A', 'E', 'F'], 'E': ['D', 'G'], 'F': ['D', 'G'], 'G': ['E', 'F']}
8.edge_list: 
[('G', 'F'), ('B', 'A'), ('F', 'D'), ('C', 'A'), ('E', 'D'), ('C', 'B'), ('G', 'E'), ('D', 'A')]
9.apv: 
[['G', 'F'], ['B', 'A'], ['F', 'D'], ['C', 'A'], ['E', 'D'], ['C', 'B'], ['G', 'E'], ['D', 'A']]
deduplicated clusters decomposed:
{0: ['A', 'C'], 1: ['B', 'D'], 2: ['E', 'F'], 3: ['G']}

Deduplicated UMI count

There are 4 connected subcomponents after running and therefore, the deduplicated count of UMIs from 7 unique UMIs is 4 at the single locus.

Shang F, Jiao LC, Shi J, Wang F, Gong M. Fast affinity propagation clustering: A multilevel approach. Pattern Recognit [Internet]. 2012;45:474–86. Available from: https://www.sciencedirect.com/science/article/pii/S0031320311002007 ↩