7. DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)1 is a widely-used clustering algorithm in data mining and machine learning, designed to identify clusters in a dataset. Unlike traditional methods like k-means, which require the number of clusters to be specified beforehand, DBSCAN is a density-based algorithm capable of discovering clusters of any shape and effectively handling noise and outliers.
We implemented a DBSCAN-based UMI deduplication method taking one-hot encoded UMI representation as input, which can be accessed via uc.dedup.dbscan
. Thus, it also needs the information on UMI sequences.
int_to_umi_dict = {
'A': 'AGATCTCGCA',
'B': 'AGATCCCGCA',
'C': 'AGATCACGCA',
'D': 'AGATCGCGCA',
'E': 'AGATCGCGGA',
'F': 'AGATCGCGTA',
'G': 'TGATCGCGAA',
}
For UMI deduplication, we can use the code below.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Parameter illustration
dbscan_eps
: The maximum distance that defines the neighborhood of a sample; two samples are considered neighbors if the distance between them is less than or equal to this value.
dbscan_min_spl
: The minimum number of samples (or total weight) required within a neighborhood for a point to qualify as a core point.
From the output, the one-hot representation can be seen. For example, AGATCTCGCA
is encoded as [1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]
. All sequences are encoded as
[[1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0]]
console
1.cc_vertices:
['A', 'B', 'C', 'D', 'E', 'F', 'G']
2.umi:
['AGATCTCGCA', 'AGATCCCGCA', 'AGATCACGCA', 'AGATCGCGCA', 'AGATCGCGGA', 'AGATCGCGTA', 'TGATCGCGAA']
3.onehot:
[array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0]), array([0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])]
4.clustering_clusters:
[array([0, 0, 0, 0, 0, 0, 1])]
5.clusters:
[['A', 'B', 'C', 'D', 'E', 'F'], ['G']]
6.clust_num:
2
7.graph_cc_adj:
{'A': ['B', 'C', 'D'], 'B': ['A', 'C'], 'C': ['A', 'B'], 'D': ['A', 'E', 'F'], 'E': ['D', 'G'], 'F': ['D', 'G'], 'G': ['E', 'F']}
8.edge_list:
[('D', 'A'), ('F', 'D'), ('C', 'B'), ('G', 'F'), ('B', 'A'), ('C', 'A'), ('E', 'D'), ('G', 'E')]
9.apv:
[['D', 'A'], ['F', 'D'], ['C', 'B'], ['G', 'F'], ['B', 'A'], ['C', 'A'], ['E', 'D'], ['G', 'E']]
deduplicated clusters decomposed:
{0: ['A', 'B', 'C', 'D', 'E', 'F'], 1: ['G']}
Deduplicated UMI count
There are 2 connected subcomponents (0 and 1) and therefore, the deduplicated count of UMIs from 7 unique UMIs is 2 at the single locus.
-
Schubert E, Sander J, Ester M, Kriegel HP, Xu X. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Trans Database Syst [Internet]. 2017;42. Available from: https://doi.org/10.1145/3068335 ↩