3. Set coverage
Greedy algorithm
The Greedy Algorithm for Set Cover is a heuristic technique used to approximate solutions for the Set Cover Problem, a challenging combinatorial optimization problem known for its NP-hardness. The objective of the Set Cover Problem is to cover a universe of elements with the fewest possible number of subsets from a given collection.
Given a universe 𝑈 of 𝑛 elements and a collection 𝑆 of m subsets of U, the task is to identify the smallest subcollection of 𝑆 such that every element in 𝑈 is included in at least one subset from this subcollection.
Greedy algorithm
Initialization
Begin with an empty set of chosen subsets.
Maintain a set to track which elements have been covered.
Iterative selection
While there are still uncovered elements in 𝑈: For each subset in 𝑆, determine how many uncovered elements it includes.
Select the subset that covers the most uncovered elements.
Add this subset to the list of selected subsets.
Update the set of covered elements to reflect the addition of the new subset.
Termination
Continue the process until all elements in 𝑈 are covered. The result is a collection of subsets that provides an approximate solution to the Set Cover Problem.
Set cover function
The Python code is implemented to perform set cover for UMI deduplication, as given below.
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84 def greedy (
multimer_list ,
recur_len ,
split_method = 'split_to_all' ,
):
if split_method == 'split_to_all' :
split_func = collapse . split_to_all
else :
split_func = collapse . split_by_mv
umi_dict = { multimer_umi : split_func (
umi = multimer_umi ,
recur_len = recur_len ,
) for multimer_umi in multimer_list }
# print(umi_dict)
monomer_umi_lens = []
multimer_umi_lens = []
merged_mono_umi_dict = {}
trimer_umi_to_id_map = { trimer_umi : k for k , trimer_umi in enumerate ( umi_dict . keys ())}
trimer_id_to_umi_map = { k : trimer_umi for k , trimer_umi in enumerate ( umi_dict . keys ())}
# print(trimer_umi_to_id_map)
# print(trimer_id_to_umi_map)
# @@ [*umi_dict.keys()]
# ['GGGTTTGTGACCCCCTGTAAATTTCCCCGGAAAGTG',
# 'GGGAAATTTTTTGTTCTCAAAGGGCAAGGGAAATTT',
# ...,
# 'AAAGGGAAACCCAAATTTGGGTTTTCGTTTCCTTTT',]
mono_umi_set_list = [ * umi_dict . values ()]
# print(mono_umi_set_list)
# @@ mono_umi_set_list
# [{'GTGCCTATCGAG', 'GTGACGATCGAT', ..., 'GTGCCGATCGAT'},
# {'GATTGCAGAGAT', 'GATTTTAGCGAT', ..., 'GATTGCAGCGAT'},
# ...,
# {'AGACATGTGTTT', 'AGACATGTCTTT', ..., 'AGACATGTTTCT'}]
mono_umi_set_list_remaining = umi_dict
num_steps = 0
is_empty_set_overlap = False
while not is_empty_set_overlap :
# It addresses how many trimer UMIs monomer UMIs can account for
mono_umi_to_trimer_id_dict = {}
for multimer_umi , mono_umi_set in mono_umi_set_list_remaining . items ():
for mono_umi in mono_umi_set :
if mono_umi in mono_umi_to_trimer_id_dict :
mono_umi_to_trimer_id_dict [ mono_umi ] . append ( trimer_umi_to_id_map [ multimer_umi ])
else :
mono_umi_to_trimer_id_dict [ mono_umi ] = [ trimer_umi_to_id_map [ multimer_umi ]]
# @@ mono_umi_to_trimer_id_dict
# {'GGATTCGGGACT': [5022, 6458], ..., 'TAAAAAGATTAT': [6890], 'TAAAATGACTAT': [6890]}
monomer_umi_lens . append ( len ( mono_umi_to_trimer_id_dict ))
monomer_umi_to_cnt_map = { k : len ( v ) for k , v in mono_umi_to_trimer_id_dict . items ()}
# @@ monomer_umi_to_cnt_map
# {'GGATTCGGGACT': 2, ..., 'TAAAAAGACTAT': 1, 'TAAAATGATTAT': 1}
if monomer_umi_to_cnt_map :
monomer_umi_max = max ( monomer_umi_to_cnt_map , key = monomer_umi_to_cnt_map . get )
else :
break
print ( monomer_umi_max )
# TTAGATGATTAT
# ...
# TTTTAAGCTGTC
# TCCTCTAGTGCC
if monomer_umi_to_cnt_map [ monomer_umi_max ] > 1 :
multimer_umi_ids = mono_umi_to_trimer_id_dict [ monomer_umi_max ]
multimer_umi_lens . append ( len ( multimer_umi_ids ) - 1 )
# important!!
# @@ this is where we keep one trimer UMI
merged_mono_umi_dict [ monomer_umi_max ] = trimer_id_to_umi_map [ mono_umi_to_trimer_id_dict [ monomer_umi_max ][ 0 ]]
for multimer_umi_id in multimer_umi_ids :
mono_umi_set_list_remaining . pop ( trimer_id_to_umi_map [ multimer_umi_id ], None )
num_steps += 1
is_empty_set_overlap = False
else :
is_empty_set_overlap = True
multimer_umi_solved_by_sc = [ * merged_mono_umi_dict . values ()]
multimer_umi_not_solved = [ * mono_umi_set_list_remaining . keys ()]
shortlisted_multimer_umi_list = multimer_umi_solved_by_sc + multimer_umi_not_solved
print ( '=========># of shortlisted multimer UMIs solved by set cover: {} ' . format ( len ( multimer_umi_solved_by_sc )))
print ( '=========># of shortlisted multimer UMIs not solved by set cover: {} ' . format ( len ( multimer_umi_not_solved )))
print ( '=========># of shortlisted multimer UMIs: {} ' . format ( len ( shortlisted_multimer_umi_list )))
dedup_cnt = len ( mono_umi_set_list ) - sum ( multimer_umi_lens )
print ( '=========>dedup cnt: {} ' . format ( dedup_cnt ))
return dedup_cnt , multimer_umi_solved_by_sc , multimer_umi_not_solved , shortlisted_multimer_umi_list , monomer_umi_lens , multimer_umi_lens
UMI deduplication
We use a BAM file containing reads demarked with trimer UMIs. It contains a total of 6949 reads observed at a single locus. To read it, we do
Python
import umiche as uc
bam = uc . io . read_bam (
bam_fpn = "/mnt/d/Document/Programming/Python/umiche/umiche/data/simu/umi/trimer/seq_errs/permute_0/trimmed/seq_err_17.bam" ,
verbose = True ,
)
df_bam = bam . todf ( tags = [ 'PO' ])
print ( df_bam )
console
30 /07/2024 20 :28:17 logger: === >reading the bam file... /mnt/d/Document/Programming/Python/umiche/umiche/data/simu/umi/trimer/seq_errs/permute_0/trimmed/seq_err_17.bam
30 /07/2024 20 :28:17 logger: === >reading BAM time: 0 .00s
30 /07/2024 20 :28:17 logger: ========= >start converting bam to df...
30 /07/2024 20 :28:17 logger: ========= >time to df: 0 .030s
id ... PO
0 0 ... 1
1 1 ... 1
2 2 ... 1
3 3 ... 1
4 4 ... 1
... ... ... ..
6944 6944 ... 1
6945 6945 ... 1
6946 6946 ... 1
6947 6947 ... 1
6948 6948 ... 1
[ 6949 rows x 13 columns]
Then, we extract only trimer sequences from them.
Python
trimer_list = df_bam . query_name . apply ( lambda x : x . split ( '_' )[ 1 ]) . values
print ( trimer_list )
console
[ 'GGGTTTGTGACCCCCTGTAAATTTCCCCGGAAAGTG'
'GGGAAATTTTTTGTTCTCAAAGGGCAAGGGAAATTT'
'TTTGGGAACAAAGGGTTTAGGTTTCGGAAAAAATTT' ...
'GGGAAAAAAGGGAACAGATATAAATTTTTTTTTCCC'
'TTTATTAAAGGAAAATTAGGGAAACTTTTTAAATTT'
'AAAGGGAAACCCAAATTTGGGTTTTCGTTTCCTTTT' ]
Using the trimers as input, we can perform deduplication with set cover.
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 import umiche as uc
(
dedup_cnt ,
multimer_umi_solved_by_sc ,
multimer_umi_not_solved ,
shortlisted_multimer_umi_list ,
monomer_umi_lens ,
multimer_umi_lens ,
) = uc . dedup . set_cover (
multimer_list = trimer_list ,
recur_len = 3 ,
split_method = 'split_to_all' ,
)
print ( dedup_cnt )
console
30 /07/2024 20 :40:55 logger: ========= ># of shortlisted multimer UMIs solved by set cover: 71
30 /07/2024 20 :40:55 logger: ========= ># of shortlisted multimer UMIs not solved by set cover: 111
30 /07/2024 20 :40:55 logger: ========= ># of shortlisted multimer UMIs: 182
30 /07/2024 20 :40:55 logger: ========= >dedup cnt: 182
Deduplicated UMI count
The result shows that 6949 reads are deduplicated as 182.