4. Gene with scRNA seq
mclumi.sc
is a module used for deduplicating UMIs aligned to genes at the single-cell RNA-seq (scRNA-seq) levels.
Usage¶
The mclumi.sc
module can be used in Python and Shell.
1. Python¶
1.1 Command¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
1.2 Attributes in the YAML file¶
Illustration
Attribute | Description |
---|---|
ed_thres |
edit distance |
gene_assigned_tag |
to enable deduplication on the gene tag (XT recommended). Any other tag if you know what tags the reads in your provided BAM file. |
gene_is_assigned_tag |
to check if reads are assigned the gene tag (XS recommended) |
bam_fpn |
input file in the BAM format |
heterogeneity |
mode to indicate if studying trajectories of UMI origins and copies/derivatives across a series of PCR amplication cycles. If heterogeneity is set to True , users will gain more statistics and files in the final output directory |
inflat_val |
inflation parameter of the Markov Clustering algorithm |
exp_val |
expansion parameter of the Markov Clustering algorithm |
iter_num |
number of interations for the Markov Clustering algorithm to converge |
mcl_fold_thres |
a fold change threshold for MCL at a range of (1, l ) where l is the length of a UMI. This parameter is placed when the mcl_ed or mcl_val function is used. |
verbose |
whether to print intermediate results |
2. Shell¶
2.1 Command¶
The shell command is simplified, but more parameters should be specified in a YAML file, params.yml
in order to reduce complexity but optimise future extension.
1 2 3 4 5 6 7 |
|
2.2 Attributes in the command¶
Illustration
Attribute | Description |
---|---|
-m |
UMI deduplication method |
-ed |
edit distance |
-pfpn |
YAML parameter file. Please see the YAML tab for parameter details |
-bfpn |
input file in the BAM format |
-wd |
working directory |
-vb |
whether to print intermediate results |
2.3 YAML file configuration¶
work_dir: /mnt/d/Document/Programming/Python/mclumi/mclumi/data/
# bam_fpn: /mnt/d/Document/Programming/Python/mclumi/mclumi/data/example_bundle.bam
tag:
# bulk
gene_assigned_tag: XT
gene_is_assigned_tag: XS
umi:
ed_thres: 1
dedup:
# mcl_val
mcl_fold_thres: 1.
inflat_val: 1.1
exp_val: 2
iter_num: 100
# mcl_ed
mcl_fold_thres: 1.6
inflat_val: 1.1
exp_val: 2
iter_num: 100
2.4 Attributes in the YAML file¶
Illustration
Attribute | Description |
---|---|
work_dir |
working directory |
ed_thres |
edit distance |
gene_assigned_tag |
to enable deduplication on the gene tag (XT recommended). Any other tag if you know what tags the reads in your provided BAM file. |
gene_is_assigned_tag |
to check if reads are assigned the gene tag (XS recommended) |
inflat_val |
inflation parameter of the Markov Clustering algorithm |
exp_val |
expansion parameter of the Markov Clustering algorithm |
iter_num |
number of interations for the Markov Clustering algorithm to converge |
mcl_fold_thres |
a fold change threshold for MCL at a range of (1, l ) where l is the length of a UMI. This parameter is placed when the mcl_ed or mcl_val function is used. |
Output¶
1. Console¶
____ ____ _______
____ ___ _____/ / / / / |/ / _/
/ __ `__ \/ ___/ / / / / /|_/ // /
/ / / / / / /__/ / /_/ / / / // /
/_/ /_/ /_/\___/_/\____/_/ /_/___/
26/07/2024 18:09:33 logger: ===>You are using method MCL to deduplicate UMIs observed at scRNA-seq levels.
26/07/2024 18:09:33 logger: ===>reading the bam file... /mnt/d/Document/Programming/Python/mclumi/mclumi/data/hgmm_100_STAR_FC_sorted.bam
26/07/2024 18:09:33 logger: ===>reading BAM time: 0.00s
26/07/2024 18:09:33 logger: =========>start converting bam to df...
26/07/2024 18:10:10 logger: =========>time to df: 37.209s
26/07/2024 18:10:10 logger: ======># of raw reads: 3553230
26/07/2024 18:10:11 logger: ======># of reads with qualified chrs: 3553230
26/07/2024 18:10:13 logger: ======># of unique barcodes: 100
26/07/2024 18:10:13 logger: ======># of unique umis: 367552
26/07/2024 18:10:13 logger: ======># of redundant umis: 588963
26/07/2024 18:10:13 logger: ======>edit distance thres: 1
26/07/2024 18:10:14 logger: ======># of gene-by-cell positions in the bam: 102842
26/07/2024 18:10:14 logger: ======># of columns in the bam df: 16
26/07/2024 18:10:14 logger: ======>Columns in the bam df: ['id', 'query_name', 'flag', 'reference_id', 'genome_pos', 'mapping_quality', 'cigar', 'query_sequence', 'next_reference_id', 'next_reference_start', 'query_qualities', 'read', 'XT', 'XS', 'bc', 'umi']
26/07/2024 18:10:14 logger: ======># of raw reads:
id ... umi
0 0 ... CCGGAGAGGG
1 1 ... CGCCCCCGGG
2 2 ... CCCGAGAATT
3 3 ... AAATAGCTAG
4 4 ... TCTGCTTTAG
... ... ... ...
3552647 3552647 ... AATCCTCTGG
3552648 3552648 ... GGCCGGTGGG
3552649 3552649 ... CAAACGATAA
3552650 3552650 ... CAACACAAGT
3552651 3552651 ... TGATTGAAGC
[588963 rows x 16 columns]
26/07/2024 18:10:14 logger: ===>start building umi graphs...
26/07/2024 18:13:13 logger: ===>time for building umi graphs: 178.78s
(AAAGATGAGAAACGAG, exon-NM_000099.4-3) {0: [0], 1: [1], 2: [2], 3: [3], 4: [4], 5: [5]}
(AAAGATGAGAAACGAG, exon-NM_000100.4-3) {0: [0], 1: [1], 2: [2], 3: [3], 4: [4], 5: [5...
(AAAGATGAGAAACGAG, exon-NM_000101.4-6) {0: [0], 1: [1], 2: [2]}
(AAAGATGAGAAACGAG, exon-NM_000146.4-3) {0: [0], 1: [1]}
(AAAGATGAGAAACGAG, exon-NM_000146.4-4) {0: [0], 1: [1], 2: [2], 3: [3], 4: [4], 5: [5...
...
(TTTGCGCCAAGTCTGT, id-TCEA1P2) {0: [0]}
(TTTGCGCCAAGTCTGT, id-TXNP6) {0: [0], 1: [1]}
(TTTGCGCCAAGTCTGT, id-VDAC1P2) {0: [0]}
(TTTGCGCCAAGTCTGT, id-YBX1P1) {0: [0]}
(TTTGCGCCAAGTCTGT, id-YBX1P10) {0: [0]}
Name: mcl, Length: 102842, dtype: object
===>Analysis has been complete and results have been saved!
vignette ... mcl_bam_ids
(AAAGATGAGAAACGAG, exon-NM_000099.4-3) {'graph_adj': {0: [], 1: [], 2: [], 3: [], 4: ... ... [3099117, 3099235, 3099257, 3099287, 3099354, ...
(AAAGATGAGAAACGAG, exon-NM_000100.4-3) {'graph_adj': {0: [], 1: [], 2: [], 3: [], 4: ... ... [3171457, 3171495, 3171504, 3171564, 3171693, ...
(AAAGATGAGAAACGAG, exon-NM_000101.4-6) {'graph_adj': {0: [], 1: [], 2: []}, 'int_to_u... ... [2455005, 2455221, 2455320]
(AAAGATGAGAAACGAG, exon-NM_000146.4-3) {'graph_adj': {0: [], 1: []}, 'int_to_umi_dict... ... [3008063, 3008100]
(AAAGATGAGAAACGAG, exon-NM_000146.4-4) {'graph_adj': {0: [], 1: [], 2: [], 3: [], 4: ... ... [3009218, 3009245, 3009263, 3009571, 3009681, ...
... ... ... ...
(TTTGCGCCAAGTCTGT, id-TCEA1P2) {'graph_adj': {0: []}, 'int_to_umi_dict': {0: ... ... [602919]
(TTTGCGCCAAGTCTGT, id-TXNP6) {'graph_adj': {0: [], 1: []}, 'int_to_umi_dict... ... [799963, 799966]
(TTTGCGCCAAGTCTGT, id-VDAC1P2) {'graph_adj': {0: []}, 'int_to_umi_dict': {0: ... ... [3284956]
(TTTGCGCCAAGTCTGT, id-YBX1P1) {'graph_adj': {0: []}, 'int_to_umi_dict': {0: ... ... [2164172]
(TTTGCGCCAAGTCTGT, id-YBX1P10) {'graph_adj': {0: []}, 'int_to_umi_dict': {0: ... ... [1503503]
[102842 rows x 12 columns]
2. Understanding results¶
After running, the generated files are shown in the following screenshot.

In the output directory, there are three files:
Annotation
{method}_ave_ed_pos_bin.txt
- in which UMI count summary results will be stored, indicating observed UMIs assigned to multiple genes.{method}_dedup_sum.txt
- deduplicated UMI counts across all reads in the newly generated bam file.{method}_dedup.bam
- in which the deduplicated reads will be stored.
Statistics provided in file mcl_ave_ed_pos_bin.txt
are tabulated as follow:
ave_eds | count |
---|---|
-1.0 | 52980 |
2.0 | 7 |
3.0 | 36 |
... | ... |
Annotation
the 1st column
- average edit distance between UMIs.the 2nd column
- number of UMI-tagged reads observed in that edit distance.*
-1.0
in columnedit distance
represents that only one unique umi seen for all cell-by-gene types; in total there are 52980 reads seen with one unique umi.
Statistics provided in file mcl_dedup_sum.txt
are tabulated as follow:
index | {method}_umi_len | ave_eds | uniq_umi_len | dedup_uniq_diff_pos | dedup_read_diff_pos |
---|---|---|---|---|---|
('AAAGATGAGAAACGAG', 'exon-NM_000099.4-3') | 6 | 8.0 | 6 | 0 | 0 |
('AAAGATGAGAAACGAG', 'exon-NM_000100.4-3') | 14 | 8.0 | 14 | 0 | 0 |
('AAAGATGAGAAACGAG', 'exon-NM_000101.4-6') | 3 | 8.0 | 3 | 0 | 0 |
... | ... | ... | ... | ... | ... |
Annotation
the 1st column
- cell-by-gene types.the 2nd column
- deduplicated UMI counts (corresponding to the number of DNA molecules/transcripts) for a given cell-by-gene type.the 3rd column
- average edit distances between UMIs for given cell-by-gene types.the 4th column
- unique UMI counts for given cell-by-gene types.the 5th column
- difference in dedup UMI counts and original unique UMI counts.the 6th column
- difference in the number of dedup reads and original reads.