2. Multiple loci
mclumi.multipos
is a module used for deduplicating UMIs observed at multiple genomic positions/loci or multiple genes.
Usage¶
The mclumi.multipos
module can be used in Python and Shell.
1. Python¶
1.1 Command¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
1.2 Attributes in the YAML file¶
Illustration
Attribute | Description |
---|---|
ed_thres |
edit distance |
pos_tag |
the tag to denote from which genomic position/locus a read is observed. For example, mclumi.prep.umi_tools function outputs reads whose positions are tagged with PO |
bam_fpn |
input file in the BAM format |
heterogeneity |
mode to indicate if studying trajectories of UMI origins and copies/derivatives across a series of PCR amplication cycles. If heterogeneity is set to True , users will gain more statistics and files in the final output directory |
inflat_val |
inflation parameter of the Markov Clustering algorithm |
exp_val |
expansion parameter of the Markov Clustering algorithm |
iter_num |
number of interations for the Markov Clustering algorithm to converge |
mcl_fold_thres |
a fold change threshold for MCL at a range of (1, l ) where l is the length of a UMI. This parameter is placed when the mcl_ed or mcl_val function is used. |
verbose |
whether to print intermediate results |
2. Shell¶
2.1 Command¶
The shell command is simplified, but more parameters should be specified in a YAML file, params.yml
in order to reduce complexity but optimise future extension.
1 2 3 4 5 6 7 |
|
2.2 Attributes in the command¶
Illustration
Attribute | Description |
---|---|
-m |
UMI deduplication method |
-ed |
edit distance |
-pfpn |
YAML parameter file. Please see the YAML tab for parameter details |
-bfpn |
input file in the BAM format |
-wd |
working directory |
-vb |
whether to print intermediate results |
2.3 YAML file configuration¶
work_dir: /mnt/d/Document/Programming/Python/mclumi/mclumi/data/
# bam_fpn: /mnt/d/Document/Programming/Python/mclumi/mclumi/data/example_bundle.bam
tag:
# loci
pos_tag: PO
umi:
ed_thres: 1
dedup:
# mcl_val
mcl_fold_thres: 1.
inflat_val: 1.1
exp_val: 2
iter_num: 100
# mcl_ed
mcl_fold_thres: 1.6
inflat_val: 1.1
exp_val: 2
iter_num: 100
2.4 Attributes in the YAML file¶
Illustration
Attribute | Description |
---|---|
work_dir |
working directory |
ed_thres |
edit distance |
pos_tag |
the tag to denote from which genomic position/locus a read is observed. For example, mclumi.prep.umi_tools function outputs reads whose positions are tagged with PO |
inflat_val |
inflation parameter of the Markov Clustering algorithm |
exp_val |
expansion parameter of the Markov Clustering algorithm |
iter_num |
number of interations for the Markov Clustering algorithm to converge |
mcl_fold_thres |
a fold change threshold for MCL at a range of (1, l ) where l is the length of a UMI. This parameter is placed when the mcl_ed or mcl_val function is used. |
Output¶
1. Console¶
____ ____ _______
____ ___ _____/ / / / / |/ / _/
/ __ `__ \/ ___/ / / / / /|_/ // /
/ / / / / / /__/ / /_/ / / / // /
/_/ /_/ /_/\___/_/\____/_/ /_/___/
26/07/2024 16:47:09 logger: ===>You are using method MCL to deduplicate UMIs observed at multiple genomic loci.
26/07/2024 16:47:09 logger: ===>reading the bam file... /mnt/d/Document/Programming/Python/mclumi/mclumi/data/example_bundle.bam
26/07/2024 16:47:09 logger: ===>reading BAM time: 0.00s
26/07/2024 16:47:09 logger: =========>start converting bam to df...
26/07/2024 16:47:17 logger: =========>time to df: 7.347s
26/07/2024 16:47:17 logger: ======># of raw reads: 1175027
26/07/2024 16:47:17 logger: ======># of reads with qualified chrs: 1175027
26/07/2024 16:47:17 logger: ======># of unique umis: 1949
26/07/2024 16:47:17 logger: ======># of redundant umis: 1175027
26/07/2024 16:47:17 logger: ======>edit distance thres: 1
26/07/2024 16:47:17 logger: ======># of columns in the bam df: 14
26/07/2024 16:47:17 logger: ======>Columns in the bam df: ['id', 'query_name', 'flag', 'reference_id', 'genome_pos', 'mapping_quality', 'cigar', 'query_sequence', 'next_reference_id', 'next_reference_start', 'query_qualities', 'read', 'PO', 'umi']
26/07/2024 16:47:17 logger: ======># of raw reads:
id query_name ... PO umi
0 0 SRR2057595.11715337_ACCGGTTTA ... 0 ACCGGTTTA
1 1 SRR2057595.11715337_ACCGGTTTA ... 0 ACCGGTTTA
2 2 SRR2057595.11715337_ACCGGTTTA ... 0 ACCGGTTTA
3 3 SRR2057595.11715337_ACCGGTTTA ... 0 ACCGGTTTA
4 4 SRR2057595.11715337_ACCGGTTTA ... 0 ACCGGTTTA
... ... ... ... ... ...
1175022 1175022 SRR2057595.8512200_ACCGGTTGG ... 12046 ACCGGTTGG
1175023 1175023 SRR2057595.8512200_ACCGGTTGG ... 12046 ACCGGTTGG
1175024 1175024 SRR2057595.8512200_ACCGGTTGG ... 12046 ACCGGTTGG
1175025 1175025 SRR2057595.8512200_ACCGGTTGG ... 12046 ACCGGTTGG
1175026 1175026 SRR2057595.8512200_ACCGGTTGG ... 12046 ACCGGTTGG
[1175027 rows x 14 columns]
26/07/2024 16:47:17 logger: ===>start building umi graphs...
26/07/2024 16:47:36 logger: ===>time for building umi graphs: 18.15s
0 {0: [0]}
1 {0: [0]}
2 {0: [0]}
3 {0: [0, 1]}
4 {0: [0]}
...
12042 {0: [0]}
12043 {0: [0]}
12044 {0: [0], 1: [1], 2: [2, 3]}
12045 {0: [0, 1], 1: [2]}
12046 {0: [0], 1: [1]}
Name: mcl, Length: 12047, dtype: object
===>Analysis has been complete and results have been saved!
vignette ... mcl_bam_ids
0 {'graph_adj': {0: []}, 'int_to_umi_dict': {0: ... ... [0]
1 {'graph_adj': {0: []}, 'int_to_umi_dict': {0: ... ... [74]
2 {'graph_adj': {0: []}, 'int_to_umi_dict': {0: ... ... [75]
3 {'graph_adj': {0: [1], 1: [0]}, 'int_to_umi_di... ... [76]
4 {'graph_adj': {0: []}, 'int_to_umi_dict': {0: ... ... [351]
... ... ... ...
12042 {'graph_adj': {0: []}, 'int_to_umi_dict': {0: ... ... [1174155]
12043 {'graph_adj': {0: []}, 'int_to_umi_dict': {0: ... ... [1174641]
12044 {'graph_adj': {0: [], 1: [], 2: [3], 3: [2]}, ... ... [1174642, 1174712, 1174760]
12045 {'graph_adj': {0: [1], 1: [0], 2: []}, 'int_to... ... [1174888, 1175019]
12046 {'graph_adj': {0: [], 1: []}, 'int_to_umi_dict... ... [1175021, 1175022]
[12047 rows x 12 columns]
2. Understanding results¶
After running, the generated files are shown in the following screenshot.

In the output directory, there are three files:
Annotation
{method}_ave_ed_pos_bin.txt
- in which UMI count summary results will be stored, indicating observed UMIs under multiple genomic positions/loci.{method}_dedup_sum.txt
- deduplicated UMI counts across all reads in the newly generated bam file.{method}_dedup.bam
- in which the deduplicated reads will be stored.
Statistics provided in file mcl_ave_ed_pos_bin.txt
are tabulated as follow:
ave_eds | count |
---|---|
-1.0 | 10470 |
2.0 | 162 |
3.0 | 206 |
... | ... |
Annotation
the 1st column
- average edit distance between UMIs at genomic positions.the 2nd column
- number of reads tagged with UMIs observed in that edit distance.*
-1.0
in columnedit distance
represents that only one unique umi seen at a single genomic position; in total there are 10470 genomic positions seen with one unique umi.
Statistics provided in file mcl_dedup_sum.txt
are tabulated as follow:
index | {method}_umi_len | ave_eds | uniq_umi_len | dedup_uniq_diff_pos | dedup_read_diff_pos |
---|---|---|---|---|---|
0 | 4 | 4.0 | 4 | 0 | 0 |
1 | 2 | 5.0 | 2 | 0 | 0 |
2 | 7 | 4.0 | 9 | 2 | 2 |
... | ... | ... | ... | ... | ... |
Annotation
the 1st column
- genomic positions of interest.the 2nd column
- deduplicated UMI counts (corresponding to the number of DNA molecules/transcripts) at genomic positions.the 3rd column
- average edit distances between UMIs at given genomic positions.the 4th column
- unique UMI counts at given genomic positions.the 5th column
- difference in dedup UMI counts and original unique UMI counts.the 6th column
- difference in the number of dedup reads and original reads.