Quick start
We set up a quick start here to guide you through an example to use mclUMI for UMI deduplication. Every module for this purpose in mclUMI provides 7 methods, that is, unique
, cluster
, adjacency
, directional
, mcl
, mcl_ed
, and mcl_val
to handle precise unique UMI counting in the following application scenarios, 1). a single genomic locus, 2). multiple genomic loci, 3). genes, and 4). cell-by-gene types.
We present a case study for UMI deduplication according to genomic positions. In mclUMI, mclumi.multipos
is responsible for UMI deduplication according to genomic positions, which allows users to deduplicate PCR artifacts/UMIs based on a set of genomic position annotations. In the quick start guide, we omit data preprocessing procedures and start by directly using a dataset (a clip of ChIP-seq data used also in UMI-tools) containing 1,175,027 reads with 20,683 raw unique UMI sequences and 12,047 genomic positions by running the UMI-tools get_bundles
method. This method is also adopted by mclUMI. For details, please refer to the mclumi.prep.run
module.
Install¶
pip install mclumi --upgrade
Running¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Result¶
After running, mclUMI generates two files for UMI deduplication statistics and one bam file for deduplicated reads. Please see details at 4 different UMI deduplication levels.


Comparison of Directional
between mclUMI and UMI-tools
The directional algorithm in the UMI-tools suite has been reported to achieve the best expectancy in identifying PCR duplicates. In mclUMI, we re-implemented the directional method to familiarize ourselves with the UMI deduplication and ensure our further optimization work based on the correct path. We then proposed a more flexible method based on Markov clustering. What does the output of the directional module in both UMI-tools (see also the results) and mclUMI look like? The same! Please see below.
