3. Set coverage

Greedy algorithm¶

The Greedy Algorithm for Set Cover is a heuristic technique used to approximate solutions for the Set Cover Problem, a challenging combinatorial optimization problem known for its NP-hardness. The objective of the Set Cover Problem is to cover a universe of elements with the fewest possible number of subsets from a given collection.

Given a universe 𝑈 of 𝑛 elements and a collection 𝑆 of m subsets of U, the task is to identify the smallest subcollection of 𝑆 such that every element in 𝑈 is included in at least one subset from this subcollection.

Greedy algorithm

Initialization
- Begin with an empty set of chosen subsets.
- Maintain a set to track which elements have been covered.
Iterative selection
- While there are still uncovered elements in 𝑈: For each subset in 𝑆, determine how many uncovered elements it includes.
  - Select the subset that covers the most uncovered elements.
  - Add this subset to the list of selected subsets.
  - Update the set of covered elements to reflect the addition of the new subset.
Termination

Continue the process until all elements in 𝑈 are covered. The result is a collection of subsets that provides an approximate solution to the Set Cover Problem.

Set cover function¶

The Python code is implemented to perform set cover for UMI deduplication, as given below.

Python

def greedy(
        multimer_list,
        recur_len,
        split_method='split_to_all',
):
    if split_method == 'split_to_all':
        split_func = collapse.split_to_all
    else:
        split_func = collapse.split_by_mv
    umi_dict = {multimer_umi: split_func(
        umi=multimer_umi,
        recur_len=recur_len,
    ) for multimer_umi in multimer_list}
    # print(umi_dict)
    monomer_umi_lens = []
    multimer_umi_lens = []
    merged_mono_umi_dict = {}
    trimer_umi_to_id_map = {trimer_umi: k for k, trimer_umi in enumerate(umi_dict.keys())}
    trimer_id_to_umi_map = {k: trimer_umi for k, trimer_umi in enumerate(umi_dict.keys())}
    # print(trimer_umi_to_id_map)
    # print(trimer_id_to_umi_map)
    # @@ [*umi_dict.keys()]
    # ['GGGTTTGTGACCCCCTGTAAATTTCCCCGGAAAGTG',
    # 'GGGAAATTTTTTGTTCTCAAAGGGCAAGGGAAATTT',
    # ...,
    # 'AAAGGGAAACCCAAATTTGGGTTTTCGTTTCCTTTT',]
    mono_umi_set_list = [*umi_dict.values()]
    # print(mono_umi_set_list)
    # @@ mono_umi_set_list
    # [{'GTGCCTATCGAG', 'GTGACGATCGAT', ..., 'GTGCCGATCGAT'},
    # {'GATTGCAGAGAT', 'GATTTTAGCGAT', ..., 'GATTGCAGCGAT'},
    # ...,
    # {'AGACATGTGTTT', 'AGACATGTCTTT', ..., 'AGACATGTTTCT'}]
    mono_umi_set_list_remaining = umi_dict
    num_steps = 0
    is_empty_set_overlap = False
    while not is_empty_set_overlap:
        # It addresses how many trimer UMIs monomer UMIs can account for
        mono_umi_to_trimer_id_dict = {}
        for multimer_umi, mono_umi_set in mono_umi_set_list_remaining.items():
            for mono_umi in mono_umi_set:
                if mono_umi in mono_umi_to_trimer_id_dict:
                    mono_umi_to_trimer_id_dict[mono_umi].append(trimer_umi_to_id_map[multimer_umi])
                else:
                    mono_umi_to_trimer_id_dict[mono_umi] = [trimer_umi_to_id_map[multimer_umi]]
        # @@ mono_umi_to_trimer_id_dict
        # {'GGATTCGGGACT': [5022, 6458], ..., 'TAAAAAGATTAT': [6890], 'TAAAATGACTAT': [6890]}
        monomer_umi_lens.append(len(mono_umi_to_trimer_id_dict))
        monomer_umi_to_cnt_map = {k: len(v) for k, v in mono_umi_to_trimer_id_dict.items()}
        # @@ monomer_umi_to_cnt_map
        # {'GGATTCGGGACT': 2, ..., 'TAAAAAGACTAT': 1, 'TAAAATGATTAT': 1}
        if monomer_umi_to_cnt_map:
            monomer_umi_max = max(monomer_umi_to_cnt_map, key=monomer_umi_to_cnt_map.get)
        else:
            break
        print(monomer_umi_max)
        # TTAGATGATTAT
        # ...
        # TTTTAAGCTGTC
        # TCCTCTAGTGCC
        if monomer_umi_to_cnt_map[monomer_umi_max] > 1:
            multimer_umi_ids = mono_umi_to_trimer_id_dict[monomer_umi_max]
            multimer_umi_lens.append(len(multimer_umi_ids) - 1)

            # important!!
            # @@ this is where we keep one trimer UMI
            merged_mono_umi_dict[monomer_umi_max] = trimer_id_to_umi_map[mono_umi_to_trimer_id_dict[monomer_umi_max][0]]

            for multimer_umi_id in multimer_umi_ids:
                mono_umi_set_list_remaining.pop(trimer_id_to_umi_map[multimer_umi_id], None)
            num_steps += 1
            is_empty_set_overlap = False
        else:
            is_empty_set_overlap = True

    multimer_umi_solved_by_sc = [*merged_mono_umi_dict.values()]
    multimer_umi_not_solved = [*mono_umi_set_list_remaining.keys()]
    shortlisted_multimer_umi_list = multimer_umi_solved_by_sc + multimer_umi_not_solved
    print('=========># of shortlisted multimer UMIs solved by set cover: {}'.format(len(multimer_umi_solved_by_sc)))
    print('=========># of shortlisted multimer UMIs not solved by set cover: {}'.format(len(multimer_umi_not_solved)))
    print('=========># of shortlisted multimer UMIs: {}'.format(len(shortlisted_multimer_umi_list)))
    dedup_cnt = len(mono_umi_set_list) - sum(multimer_umi_lens)
    print('=========>dedup cnt: {}'.format(dedup_cnt))
    return dedup_cnt, multimer_umi_solved_by_sc, multimer_umi_not_solved, shortlisted_multimer_umi_list, monomer_umi_lens, multimer_umi_lens

UMI deduplication¶

We use a BAM file containing reads demarked with trimer UMIs. It contains a total of 6949 reads observed at a single locus. To read it, we do

Python

import umiche as uc

bam = uc.io.read_bam(
    bam_fpn="/mnt/d/Document/Programming/Python/umiche/umiche/data/simu/umi/trimer/seq_errs/permute_0/trimmed/seq_err_17.bam",
    verbose=True,
)
df_bam = bam.todf(tags=['PO'])
print(df_bam)

console

30/07/2024 20:28:17 logger: ===>reading the bam file... /mnt/d/Document/Programming/Python/umiche/umiche/data/simu/umi/trimer/seq_errs/permute_0/trimmed/seq_err_17.bam
30/07/2024 20:28:17 logger: ===>reading BAM time: 0.00s
30/07/2024 20:28:17 logger: =========>start converting bam to df...
30/07/2024 20:28:17 logger: =========>time to df: 0.030s
        id  ... PO
0        0  ...  1
1        1  ...  1
2        2  ...  1
3        3  ...  1
4        4  ...  1
...    ...  ... ..
6944  6944  ...  1
6945  6945  ...  1
6946  6946  ...  1
6947  6947  ...  1
6948  6948  ...  1

[6949 rows x 13 columns]

Then, we extract only trimer sequences from them.

Python

trimer_list = df_bam.query_name.apply(lambda x: x.split('_')[1]).values
print(trimer_list)

console

['GGGTTTGTGACCCCCTGTAAATTTCCCCGGAAAGTG'
 'GGGAAATTTTTTGTTCTCAAAGGGCAAGGGAAATTT'
 'TTTGGGAACAAAGGGTTTAGGTTTCGGAAAAAATTT' ...
 'GGGAAAAAAGGGAACAGATATAAATTTTTTTTTCCC'
 'TTTATTAAAGGAAAATTAGGGAAACTTTTTAAATTT'
 'AAAGGGAAACCCAAATTTGGGTTTTCGTTTCCTTTT']

Using the trimers as input, we can perform deduplication with set cover.

Python

import umiche as uc

(
    dedup_cnt,
    multimer_umi_solved_by_sc,
    multimer_umi_not_solved,
    shortlisted_multimer_umi_list,
    monomer_umi_lens,
    multimer_umi_lens,
) = uc.dedup.set_cover(
    multimer_list=trimer_list,
    recur_len=3,
    split_method='split_to_all',
)
print(dedup_cnt)

console

30/07/2024 20:40:55 logger: =========># of shortlisted multimer UMIs solved by set cover: 71
30/07/2024 20:40:55 logger: =========># of shortlisted multimer UMIs not solved by set cover: 111
30/07/2024 20:40:55 logger: =========># of shortlisted multimer UMIs: 182
30/07/2024 20:40:55 logger: =========>dedup cnt: 182

Deduplicated UMI count

The result shows that 6949 reads are deduplicated as 182.