2. Majority vote
Majority vote function¶
The Python code implemented to perform the majority vote for UMI deduplication is shown below.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
UMI deduplication¶
We use a BAM file containing reads demarked with trimer UMIs. It contains a total of 6949 reads observed at a single locus. To read it, we do
Python
1 2 3 4 5 6 7 8 |
|
console
30/07/2024 20:28:17 logger: ===>reading the bam file... /mnt/d/Document/Programming/Python/umiche/umiche/data/simu/umi/trimer/seq_errs/permute_0/trimmed/seq_err_17.bam
30/07/2024 20:28:17 logger: ===>reading BAM time: 0.00s
30/07/2024 20:28:17 logger: =========>start converting bam to df...
30/07/2024 20:28:17 logger: =========>time to df: 0.030s
id ... PO
0 0 ... 1
1 1 ... 1
2 2 ... 1
3 3 ... 1
4 4 ... 1
... ... ... ..
6944 6944 ... 1
6945 6945 ... 1
6946 6946 ... 1
6947 6947 ... 1
6948 6948 ... 1
[6949 rows x 13 columns]
Then, we extract only trimer sequences from them.
Python
1 2 |
|
console
['GGGTTTGTGACCCCCTGTAAATTTCCCCGGAAAGTG'
'GGGAAATTTTTTGTTCTCAAAGGGCAAGGGAAATTT'
'TTTGGGAACAAAGGGTTTAGGTTTCGGAAAAAATTT' ...
'GGGAAAAAAGGGAACAGATATAAATTTTTTTTTCCC'
'TTTATTAAAGGAAAATTAGGGAAACTTTTTAAATTT'
'AAAGGGAAACCCAAATTTGGGTTTTCGTTTCCTTTT']
Using the trimers as input, we can perform deduplication with set cover.
Python
1 2 3 4 5 6 7 8 9 10 11 12 |
|
console
30/07/2024 20:49:14 logger: ===>reading the bam file... /mnt/d/Document/Programming/Python/umiche/umiche/data/simu/umi/trimer/seq_errs/permute_0/trimmed/seq_err_17.bam
30/07/2024 20:49:14 logger: ===>reading BAM time: 0.00s
30/07/2024 20:49:14 logger: =========>start converting bam to df...
30/07/2024 20:49:14 logger: =========>time to df: 0.059s
30/07/2024 20:49:14 logger: =========># of shortlisted multimer UMIs: 1501
30/07/2024 20:49:14 logger: =========>dedup cnt: 1501
The shortlisted_multimer_umi_list
contains all trimer UMIs left after deduplication.
['GGGTTTGGGCCCCCCTTTGAATTTACCCGGAAAGGG',
'AGGAAATTCTTTTCTCCCAAAGGGAAAGGGAAATTT',
'TTTGGGAAAAAAGGGTTAGGGTTTGGGAAAAAATTT',
'TTTGGGAAAAAAAAAAAAGGAGGCAAACCCGGGTTT',
'AGGCCCGGGAAAAAAGTGAAATGGGGCAAAGGAGGG',
'TTTCCCCACTTTCACTTTAAAGGATTTGGGCCCCCC',
'TGTCCCCCCAAATTTCCACCCAAACTATTTGGGCTC',
...
]
Deduplicated UMI count
The result shows that 6949 reads are deduplicated as 1501.