Skip to content

Quick start

We set up a quick start here to guide you through an example to use mclUMI for UMI deduplication. Every module for this purpose in mclUMI provides 7 methods, that is, unique, cluster, adjacency, directional, mcl, mcl_ed, and mcl_val to handle precise unique UMI counting in the following application scenarios, 1). a single genomic locus, 2). multiple genomic loci, 3). genes, and 4). cell-by-gene types.

We present a case study for UMI deduplication according to genomic positions. In mclUMI, mclumi.multipos is responsible for UMI deduplication according to genomic positions, which allows users to deduplicate PCR artifacts/UMIs based on a set of genomic position annotations. In the quick start guide, we omit data preprocessing procedures and start by directly using a dataset (a clip of ChIP-seq data used also in UMI-tools) containing 1,175,027 reads with 20,683 raw unique UMI sequences and 12,047 genomic positions by running the UMI-tools get_bundles method. This method is also adopted by mclUMI. For details, please refer to the mclumi.prep.run module.

Install

pip install mclumi --upgrade

Running

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import mclumi as mu

df_mcl = mu.multipos.mcl(
    bam_fpn=to('data/example_bundle.bam'),
    ed_thres=1,
    pos_tag='PO',
    work_dir=to('data/'),
    verbose=False,  # False True

    heterogeneity=False,  # False True

    inflat_val=1.6,
    exp_val=2,
    iter_num=100,
)

print(df_mcl)

Result

After running, mclUMI generates two files for UMI deduplication statistics and one bam file for deduplicated reads. Please see details at 4 different UMI deduplication levels.

Image title
Fig 1. Average edit distance observed at multiple genomic positions
Image title
Fig 2. Statistics of UMI deduplication using mclUMI

Comparison of Directional between mclUMI and UMI-tools

The directional algorithm in the UMI-tools suite has been reported to achieve the best expectancy in identifying PCR duplicates. In mclUMI, we re-implemented the directional method to familiarize ourselves with the UMI deduplication and ensure our further optimization work based on the correct path. We then proposed a more flexible method based on Markov clustering. What does the output of the directional module in both UMI-tools (see also the results) and mclUMI look like? The same! Please see below.

Image title
Fig 3. Average edit distance distribution of genomic positions