8. BIRCH
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)1 is a hierarchical clustering algorithm optimized for efficiently managing large datasets. Its scalable and incremental design enables it to process data in a single pass and adapt dynamically to new data without needing to reprocess the entire dataset. This makes it particularly effective for clustering large volumes of data.
Feature
BIRCH constructs a Clustering Feature Tree (CF Tree) to efficiently represent the dataset. This tree comprises nodes, each containing a concise summary of the data, referred to as Clustering Features (CFs).
We can use uc.dedup.birch
to perform the collapsing of UMIs using UMI sequences as input.
int_to_umi_dict = {
'A': 'AGATCTCGCA',
'B': 'AGATCCCGCA',
'C': 'AGATCACGCA',
'D': 'AGATCGCGCA',
'E': 'AGATCGCGGA',
'F': 'AGATCGCGTA',
'G': 'TGATCGCGAA',
}
For UMI deduplication, we can use the code below.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Parameter illustration
birch_thres
: The threshold radius determines that the subcluster formed by merging a new sample with the nearest subcluster must not exceed this value.
birch_n_clusters
: This parameter specifies the number of clusters to be formed during the final clustering step, where the subclusters at the leaves are considered as individual samples.
console
1.cc_vertices:
['A', 'B', 'C', 'D', 'E', 'F', 'G']
2.umi:
['AGATCTCGCA', 'AGATCCCGCA', 'AGATCACGCA', 'AGATCGCGCA', 'AGATCGCGGA', 'AGATCGCGTA', 'TGATCGCGAA']
3.onehot:
[array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0]), array([0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])]
4.clustering_clusters:
[array([0, 0, 0, 0, 0, 0, 0])]
5.clusters:
[['A', 'B', 'C', 'D', 'E', 'F', 'G']]
6.clust_num:
1
7.graph_cc_adj:
{'A': ['B', 'C', 'D'], 'B': ['A', 'C'], 'C': ['A', 'B'], 'D': ['A', 'E', 'F'], 'E': ['D', 'G'], 'F': ['D', 'G'], 'G': ['E', 'F']}
8.edge_list:
[('C', 'B'), ('F', 'D'), ('G', 'F'), ('B', 'A'), ('D', 'A'), ('E', 'D'), ('C', 'A'), ('G', 'E')]
9.apv:
[['C', 'B'], ['F', 'D'], ['G', 'F'], ['B', 'A'], ['D', 'A'], ['E', 'D'], ['C', 'A'], ['G', 'E']]
deduplicated clusters decomposed:
{0: ['A', 'B', 'C', 'D', 'E', 'F', 'G']}
Deduplicated UMI count
There is 1 connected subcomponent containing all the 7 UMIs (A
, B
, C
, D
, E
, F
, G
) and therefore, the deduplicated count of UMIs is 1 at the single locus.
-
Zhang T, Ramakrishnan R, Livny M. BIRCH: A New Data Clustering Algorithm and Its Applications. Data Min Knowl Discov [Internet]. 1997;1:141–82. Available from: https://doi.org/10.1023/A:1009783824328 ↩