7. DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)¹ is a widely-used clustering algorithm in data mining and machine learning, designed to identify clusters in a dataset. Unlike traditional methods like k-means, which require the number of clusters to be specified beforehand, DBSCAN is a density-based algorithm capable of discovering clusters of any shape and effectively handling noise and outliers.

We implemented a DBSCAN-based UMI deduplication method taking one-hot encoded UMI representation as input, which can be accessed via uc.dedup.dbscan. Thus, it also needs the information on UMI sequences.

int_to_umi_dict = {
    'A': 'AGATCTCGCA',
    'B': 'AGATCCCGCA',
    'C': 'AGATCACGCA',
    'D': 'AGATCGCGCA',
    'E': 'AGATCGCGGA',
    'F': 'AGATCGCGTA',
    'G': 'TGATCGCGAA',
}

For UMI deduplication, we can use the code below.

Python

import umiche as uc

df = uc.dedup.dbscan(
    connected_components=ccs,
    graph_adj=graph_adj,
    # df_umi_uniq_val_cnt=node_val_sorted,
    int_to_umi_dict=int_to_umi_dict,
    dbscan_eps=1.5,
    dbscan_min_spl=1,
)
for i, col in enumerate(df.columns):
    print("{}.{}: \n{}".format(i+1, col, df[col].values[0]))
df_decomposed = decompose_mcl(list_nd=df['clusters'].values)
print("deduplicated clusters decomposed:\n{}".format(df_decomposed))

Parameter illustration

dbscan_eps: The maximum distance that defines the neighborhood of a sample; two samples are considered neighbors if the distance between them is less than or equal to this value.

dbscan_min_spl: The minimum number of samples (or total weight) required within a neighborhood for a point to qualify as a core point.

From the output, the one-hot representation can be seen. For example, AGATCTCGCA is encoded as [1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]. All sequences are encoded as

[[1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0]]

console

1.cc_vertices: 
['A', 'B', 'C', 'D', 'E', 'F', 'G']
2.umi: 
['AGATCTCGCA', 'AGATCCCGCA', 'AGATCACGCA', 'AGATCGCGCA', 'AGATCGCGGA', 'AGATCGCGTA', 'TGATCGCGAA']
3.onehot: 
[array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0]), array([0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])]
4.clustering_clusters: 
[array([0, 0, 0, 0, 0, 0, 1])]
5.clusters: 
[['A', 'B', 'C', 'D', 'E', 'F'], ['G']]
6.clust_num: 
2
7.graph_cc_adj: 
{'A': ['B', 'C', 'D'], 'B': ['A', 'C'], 'C': ['A', 'B'], 'D': ['A', 'E', 'F'], 'E': ['D', 'G'], 'F': ['D', 'G'], 'G': ['E', 'F']}
8.edge_list: 
[('D', 'A'), ('F', 'D'), ('C', 'B'), ('G', 'F'), ('B', 'A'), ('C', 'A'), ('E', 'D'), ('G', 'E')]
9.apv: 
[['D', 'A'], ['F', 'D'], ['C', 'B'], ['G', 'F'], ['B', 'A'], ['C', 'A'], ['E', 'D'], ['G', 'E']]
deduplicated clusters decomposed:
{0: ['A', 'B', 'C', 'D', 'E', 'F'], 1: ['G']}

Deduplicated UMI count

There are 2 connected subcomponents (0 and 1) and therefore, the deduplicated count of UMIs from 7 unique UMIs is 2 at the single locus.

Schubert E, Sander J, Ester M, Kriegel HP, Xu X. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Trans Database Syst [Internet]. 2017;42. Available from: https://doi.org/10.1145/3068335 ↩