Quick start

We set up a quick start here to guide you through an example to use mclUMI for UMI deduplication. Every module for this purpose in mclUMI provides 7 methods, that is, unique, cluster, adjacency, directional, mcl, mcl_ed, and mcl_val to handle precise unique UMI counting in the following application scenarios, 1). a single genomic locus, 2). multiple genomic loci, 3). genes, and 4). cell-by-gene types.

We present a case study for UMI deduplication according to genomic positions. In mclUMI, mclumi.multipos is responsible for UMI deduplication according to genomic positions, which allows users to deduplicate PCR artifacts/UMIs based on a set of genomic position annotations. In the quick start guide, we omit data preprocessing procedures and start by directly using a dataset (a clip of ChIP-seq data used also in UMI-tools) containing 1,175,027 reads with 20,683 raw unique UMI sequences and 12,047 genomic positions by running the UMI-tools get_bundles method. This method is also adopted by mclUMI. For details, please refer to the mclumi.prep.run module.

Install¶

pip install mclumi --upgrade

Running¶

import mclumi as mu

df_mcl = mu.multipos.mcl(
    bam_fpn=to('data/example_bundle.bam'),
    ed_thres=1,
    pos_tag='PO',
    work_dir=to('data/'),
    verbose=False,  # False True

    heterogeneity=False,  # False True

    inflat_val=1.6,
    exp_val=2,
    iter_num=100,
)

print(df_mcl)

Result¶

After running, mclUMI generates two files for UMI deduplication statistics and one bam file for deduplicated reads. Please see details at 4 different UMI deduplication levels.

Image title — **Fig** 1. Average edit distance observed at multiple genomic positions

Comparison of Directional between mclUMI and UMI-tools

The directional algorithm in the UMI-tools suite has been reported to achieve the best expectancy in identifying PCR duplicates. In mclUMI, we re-implemented the directional method to familiarize ourselves with the UMI deduplication and ensure our further optimization work based on the correct path. We then proposed a more flexible method based on Markov clustering. What does the output of the directional module in both UMI-tools (see also the results) and mclUMI look like? The same! Please see below.