8. BIRCH

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)¹ is a hierarchical clustering algorithm optimized for efficiently managing large datasets. Its scalable and incremental design enables it to process data in a single pass and adapt dynamically to new data without needing to reprocess the entire dataset. This makes it particularly effective for clustering large volumes of data.

Feature

BIRCH constructs a Clustering Feature Tree (CF Tree) to efficiently represent the dataset. This tree comprises nodes, each containing a concise summary of the data, referred to as Clustering Features (CFs).

We can use uc.dedup.birch to perform the collapsing of UMIs using UMI sequences as input.

int_to_umi_dict = {
    'A': 'AGATCTCGCA',
    'B': 'AGATCCCGCA',
    'C': 'AGATCACGCA',
    'D': 'AGATCGCGCA',
    'E': 'AGATCGCGGA',
    'F': 'AGATCGCGTA',
    'G': 'TGATCGCGAA',
}

For UMI deduplication, we can use the code below.

Python

import umiche as uc

df = uc.dedup.birch(
    connected_components=ccs,
    graph_adj=graph_adj,
    int_to_umi_dict=int_to_umi_dict,
    birch_thres=1.8,
    birch_n_clusters=None,
)
# print(df)
for i, col in enumerate(df.columns):
    print("{}.{}: \n{}".format(i + 1, col, df[col].values[0]))
df_decomposed = decompose_mcl(list_nd=df['clusters'].values)
print("deduplicated clusters decomposed:\n{}".format(df_decomposed))

Parameter illustration

birch_thres: The threshold radius determines that the subcluster formed by merging a new sample with the nearest subcluster must not exceed this value.

birch_n_clusters: This parameter specifies the number of clusters to be formed during the final clustering step, where the subclusters at the leaves are considered as individual samples.

console

1.cc_vertices: 
['A', 'B', 'C', 'D', 'E', 'F', 'G']
2.umi: 
['AGATCTCGCA', 'AGATCCCGCA', 'AGATCACGCA', 'AGATCGCGCA', 'AGATCGCGGA', 'AGATCGCGTA', 'TGATCGCGAA']
3.onehot: 
[array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0]), array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0]), array([0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])]
4.clustering_clusters: 
[array([0, 0, 0, 0, 0, 0, 0])]
5.clusters: 
[['A', 'B', 'C', 'D', 'E', 'F', 'G']]
6.clust_num: 
1
7.graph_cc_adj: 
{'A': ['B', 'C', 'D'], 'B': ['A', 'C'], 'C': ['A', 'B'], 'D': ['A', 'E', 'F'], 'E': ['D', 'G'], 'F': ['D', 'G'], 'G': ['E', 'F']}
8.edge_list: 
[('C', 'B'), ('F', 'D'), ('G', 'F'), ('B', 'A'), ('D', 'A'), ('E', 'D'), ('C', 'A'), ('G', 'E')]
9.apv: 
[['C', 'B'], ['F', 'D'], ['G', 'F'], ['B', 'A'], ['D', 'A'], ['E', 'D'], ['C', 'A'], ['G', 'E']]
deduplicated clusters decomposed:
{0: ['A', 'B', 'C', 'D', 'E', 'F', 'G']}

Deduplicated UMI count

There is 1 connected subcomponent containing all the 7 UMIs (A, B, C, D, E, F, G) and therefore, the deduplicated count of UMIs is 1 at the single locus.

Zhang T, Ramakrishnan R, Livny M. BIRCH: A New Data Clustering Algorithm and Its Applications. Data Min Knowl Discov [Internet]. 1997;1:141–82. Available from: https://doi.org/10.1023/A:1009783824328 ↩