Skip to content

2. Majority vote

Majority vote function

The Python code implemented to perform the majority vote for UMI deduplication is shown below.

Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def track(
        multimer_list,
        recur_len,
):
    multimer_umi_to_mono_umi_map = {multimer_umi: self.collapse.majority_vote(
        umi=multimer_umi,
        recur_len=recur_len,
    ) for multimer_umi in multimer_list}
    mono_umi_to_multimer_umi_map = {self.collapse.majority_vote(
        umi=multimer_umi,
        recur_len=recur_len,
    ): multimer_umi for multimer_umi in multimer_list}
    uniq_multimer_cnt = len(multimer_umi_to_mono_umi_map)
    shortlisted_multimer_umi_list = [*mono_umi_to_multimer_umi_map.values()]
    dedup_cnt = len(shortlisted_multimer_umi_list)
    print('=========># of shortlisted multimer UMIs: {}'.format(len(shortlisted_multimer_umi_list)))
    print('=========>dedup cnt: {}'.format(dedup_cnt))
    return dedup_cnt, uniq_multimer_cnt, shortlisted_multimer_umi_list

UMI deduplication

We use a BAM file containing reads demarked with trimer UMIs. It contains a total of 6949 reads observed at a single locus. To read it, we do

Python

1
2
3
4
5
6
7
8
import umiche as uc

bam = uc.io.read_bam(
    bam_fpn="/mnt/d/Document/Programming/Python/umiche/umiche/data/simu/umi/trimer/seq_errs/permute_0/trimmed/seq_err_17.bam",
    verbose=True,
)
df_bam = bam.todf(tags=['PO'])
print(df_bam)

console

30/07/2024 20:28:17 logger: ===>reading the bam file... /mnt/d/Document/Programming/Python/umiche/umiche/data/simu/umi/trimer/seq_errs/permute_0/trimmed/seq_err_17.bam
30/07/2024 20:28:17 logger: ===>reading BAM time: 0.00s
30/07/2024 20:28:17 logger: =========>start converting bam to df...
30/07/2024 20:28:17 logger: =========>time to df: 0.030s
        id  ... PO
0        0  ...  1
1        1  ...  1
2        2  ...  1
3        3  ...  1
4        4  ...  1
...    ...  ... ..
6944  6944  ...  1
6945  6945  ...  1
6946  6946  ...  1
6947  6947  ...  1
6948  6948  ...  1

[6949 rows x 13 columns]

Then, we extract only trimer sequences from them.

Python

1
2
trimer_list = df_bam.query_name.apply(lambda x: x.split('_')[1]).values
print(trimer_list)

console

['GGGTTTGTGACCCCCTGTAAATTTCCCCGGAAAGTG'
 'GGGAAATTTTTTGTTCTCAAAGGGCAAGGGAAATTT'
 'TTTGGGAACAAAGGGTTTAGGTTTCGGAAAAAATTT' ...
 'GGGAAAAAAGGGAACAGATATAAATTTTTTTTTCCC'
 'TTTATTAAAGGAAAATTAGGGAAACTTTTTAAATTT'
 'AAAGGGAAACCCAAATTTGGGTTTTCGTTTCCTTTT']

Using the trimers as input, we can perform deduplication with set cover.

Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import umiche as uc

(
    dedup_cnt,
    uniq_multimer_cnt,
    shortlisted_multimer_umi_list,
) = uc.dedup.majority_vote(
    multimer_list=trimer_list,
    recur_len=3,
    verbose=True
)
print(dedup_cnt)

console

30/07/2024 20:49:14 logger: ===>reading the bam file... /mnt/d/Document/Programming/Python/umiche/umiche/data/simu/umi/trimer/seq_errs/permute_0/trimmed/seq_err_17.bam
30/07/2024 20:49:14 logger: ===>reading BAM time: 0.00s
30/07/2024 20:49:14 logger: =========>start converting bam to df...
30/07/2024 20:49:14 logger: =========>time to df: 0.059s
30/07/2024 20:49:14 logger: =========># of shortlisted multimer UMIs: 1501
30/07/2024 20:49:14 logger: =========>dedup cnt: 1501

The shortlisted_multimer_umi_list contains all trimer UMIs left after deduplication.

['GGGTTTGGGCCCCCCTTTGAATTTACCCGGAAAGGG',
 'AGGAAATTCTTTTCTCCCAAAGGGAAAGGGAAATTT', 
 'TTTGGGAAAAAAGGGTTAGGGTTTGGGAAAAAATTT', 
 'TTTGGGAAAAAAAAAAAAGGAGGCAAACCCGGGTTT', 
 'AGGCCCGGGAAAAAAGTGAAATGGGGCAAAGGAGGG', 
 'TTTCCCCACTTTCACTTTAAAGGATTTGGGCCCCCC', 
 'TGTCCCCCCAAATTTCCACCCAAACTATTTGGGCTC',
...
]

Deduplicated UMI count

The result shows that 6949 reads are deduplicated as 1501.