Skip to content

4. Gene with scRNA seq

mclumi.sc is a module used for deduplicating UMIs aligned to genes at the single-cell RNA-seq (scRNA-seq) levels.

Usage

The mclumi.sc module can be used in Python and Shell.

1. Python

1.1 Command

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import mclumi as mu

df_mcl = mu.sc.mcl(
    bam_fpn=to('data/hgmm_100_STAR_FC_sorted.bam'),
    ed_thres=1,
    gene_assigned_tag='XT',
    gene_is_assigned_tag='XS',
    work_dir=to('data/'),
    verbose=False,  # False True

    heterogeneity=False,  # False True

    inflat_val=1.6,
    exp_val=2,
    iter_num=100,
)

print(df_mcl)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import mclumi as mu

df_mcl_val = mu.sc.mcl_val(
    bam_fpn=to('data/hgmm_100_STAR_FC_sorted.bam'),
    ed_thres=1,
    gene_assigned_tag='XT',
    gene_is_assigned_tag='XS',
    work_dir=to('data/'),
    verbose=False,  # False True

    heterogeneity=False,  # False True

    mcl_fold_thres=1.5,
    inflat_val=1.6,
    exp_val=2,
    iter_num=100,
)

print(df_mcl_val)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import mclumi as mu

df_mcl_ed = mu.sc.mcl_ed(
    bam_fpn=to('data/hgmm_100_STAR_FC_sorted.bam'),
    ed_thres=1,
    gene_assigned_tag='XT',
    gene_is_assigned_tag='XS',
    work_dir=to('data/'),
    verbose=False,  # False True

    heterogeneity=False,  # False True

    mcl_fold_thres=1.5,
    inflat_val=1.6,
    exp_val=2,
    iter_num=100,
)

print(df_mcl_ed)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import mclumi as mu

df_unique = mu.sc.unique(
    bam_fpn=to('data/hgmm_100_STAR_FC_sorted.bam'),
    ed_thres=1,
    gene_assigned_tag='XT',
    gene_is_assigned_tag='XS',
    work_dir=to('data/'),
    verbose=False,  # False True

    heterogeneity=False,  # False True
)

print(df_unique)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import mclumi as mu

df_cluster = mu.sc.cluster(
    bam_fpn=to('data/hgmm_100_STAR_FC_sorted.bam'),
    ed_thres=1,
    gene_assigned_tag='XT',
    gene_is_assigned_tag='XS',
    work_dir=to('data/'),
    verbose=False,  # False True

    heterogeneity=False,  # False True
)

print(df_cluster)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import mclumi as mu

df_adjacency = mu.sc.adjacency(
    bam_fpn=to('data/hgmm_100_STAR_FC_sorted.bam'),
    ed_thres=1,
    gene_assigned_tag='XT',
    gene_is_assigned_tag='XS',
    work_dir=to('data/'),
    verbose=False,  # False True

    heterogeneity=False,  # False True
)

print(df_adjacency)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import mclumi as mu

df_directional = mu.sc.directional(
    bam_fpn=to('data/hgmm_100_STAR_FC_sorted.bam'),
    ed_thres=1,
    gene_assigned_tag='XT',
    gene_is_assigned_tag='XS',
    work_dir=to('data/'),
    verbose=False,  # False True

    heterogeneity=False,  # False True
)

print(df_directional)

1.2 Attributes in the YAML file

Illustration

Attribute Description
ed_thres edit distance
gene_assigned_tag to enable deduplication on the gene tag (XT recommended). Any other tag if you know what tags the reads in your provided BAM file.
gene_is_assigned_tag to check if reads are assigned the gene tag (XS recommended)
bam_fpn input file in the BAM format
heterogeneity mode to indicate if studying trajectories of UMI origins and copies/derivatives across a series of PCR amplication cycles. If heterogeneity is set to True, users will gain more statistics and files in the final output directory
inflat_val inflation parameter of the Markov Clustering algorithm
exp_val expansion parameter of the Markov Clustering algorithm
iter_num number of interations for the Markov Clustering algorithm to converge
mcl_fold_thres a fold change threshold for MCL at a range of (1, l) where l is the length of a UMI. This parameter is placed when the mcl_ed or mcl_val function is used.
verbose whether to print intermediate results

2. Shell

2.1 Command

The shell command is simplified, but more parameters should be specified in a YAML file, params.yml in order to reduce complexity but optimise future extension.

1
2
3
4
5
6
7
mclumi sc \
-m mcl \
-ed 1  \
-pfpn ./mclumi/data/params.yml \
-bfpn ./mclumi/data/hgmm_100_STAR_FC_sorted.bam \
-wd ./mclumi/data/ \
-vb True

2.2 Attributes in the command

Illustration

Attribute Description
-m UMI deduplication method
-ed edit distance
-pfpn YAML parameter file. Please see the YAML tab for parameter details
-bfpn input file in the BAM format
-wd working directory
-vb whether to print intermediate results

2.3 YAML file configuration

work_dir: /mnt/d/Document/Programming/Python/mclumi/mclumi/data/

# bam_fpn: /mnt/d/Document/Programming/Python/mclumi/mclumi/data/example_bundle.bam

tag:
  # bulk
  gene_assigned_tag: XT
  gene_is_assigned_tag: XS

umi:
  ed_thres: 1

dedup:
  # mcl_val
  mcl_fold_thres: 1.
  inflat_val: 1.1
  exp_val: 2
  iter_num: 100

  # mcl_ed
  mcl_fold_thres: 1.6
  inflat_val: 1.1
  exp_val: 2
  iter_num: 100

2.4 Attributes in the YAML file

Illustration

Attribute Description
work_dir working directory
ed_thres edit distance
gene_assigned_tag to enable deduplication on the gene tag (XT recommended). Any other tag if you know what tags the reads in your provided BAM file.
gene_is_assigned_tag to check if reads are assigned the gene tag (XS recommended)
inflat_val inflation parameter of the Markov Clustering algorithm
exp_val expansion parameter of the Markov Clustering algorithm
iter_num number of interations for the Markov Clustering algorithm to converge
mcl_fold_thres a fold change threshold for MCL at a range of (1, l) where l is the length of a UMI. This parameter is placed when the mcl_ed or mcl_val function is used.

Output

1. Console

                   ____  ____  _______
   ____ ___  _____/ / / / /  |/  /  _/
  / __ `__ \/ ___/ / / / / /|_/ // /  
 / / / / / / /__/ / /_/ / /  / // /   
/_/ /_/ /_/\___/_/\____/_/  /_/___/   


26/07/2024 18:09:33 logger: ===>You are using method MCL to deduplicate UMIs observed at scRNA-seq levels.
26/07/2024 18:09:33 logger: ===>reading the bam file... /mnt/d/Document/Programming/Python/mclumi/mclumi/data/hgmm_100_STAR_FC_sorted.bam
26/07/2024 18:09:33 logger: ===>reading BAM time: 0.00s
26/07/2024 18:09:33 logger: =========>start converting bam to df...
26/07/2024 18:10:10 logger: =========>time to df: 37.209s
26/07/2024 18:10:10 logger: ======># of raw reads: 3553230
26/07/2024 18:10:11 logger: ======># of reads with qualified chrs: 3553230
26/07/2024 18:10:13 logger: ======># of unique barcodes: 100
26/07/2024 18:10:13 logger: ======># of unique umis: 367552
26/07/2024 18:10:13 logger: ======># of redundant umis: 588963
26/07/2024 18:10:13 logger: ======>edit distance thres: 1
26/07/2024 18:10:14 logger: ======># of gene-by-cell positions in the bam: 102842
26/07/2024 18:10:14 logger: ======># of columns in the bam df: 16
26/07/2024 18:10:14 logger: ======>Columns in the bam df: ['id', 'query_name', 'flag', 'reference_id', 'genome_pos', 'mapping_quality', 'cigar', 'query_sequence', 'next_reference_id', 'next_reference_start', 'query_qualities', 'read', 'XT', 'XS', 'bc', 'umi']
26/07/2024 18:10:14 logger: ======># of raw reads:
              id  ...         umi
0              0  ...  CCGGAGAGGG
1              1  ...  CGCCCCCGGG
2              2  ...  CCCGAGAATT
3              3  ...  AAATAGCTAG
4              4  ...  TCTGCTTTAG
...          ...  ...         ...
3552647  3552647  ...  AATCCTCTGG
3552648  3552648  ...  GGCCGGTGGG
3552649  3552649  ...  CAAACGATAA
3552650  3552650  ...  CAACACAAGT
3552651  3552651  ...  TGATTGAAGC

[588963 rows x 16 columns]
26/07/2024 18:10:14 logger: ===>start building umi graphs...
26/07/2024 18:13:13 logger: ===>time for building umi graphs: 178.78s
(AAAGATGAGAAACGAG, exon-NM_000099.4-3)     {0: [0], 1: [1], 2: [2], 3: [3], 4: [4], 5: [5]}
(AAAGATGAGAAACGAG, exon-NM_000100.4-3)    {0: [0], 1: [1], 2: [2], 3: [3], 4: [4], 5: [5...
(AAAGATGAGAAACGAG, exon-NM_000101.4-6)                             {0: [0], 1: [1], 2: [2]}
(AAAGATGAGAAACGAG, exon-NM_000146.4-3)                                     {0: [0], 1: [1]}
(AAAGATGAGAAACGAG, exon-NM_000146.4-4)    {0: [0], 1: [1], 2: [2], 3: [3], 4: [4], 5: [5...
                                                                ...                        
(TTTGCGCCAAGTCTGT, id-TCEA1P2)                                                     {0: [0]}
(TTTGCGCCAAGTCTGT, id-TXNP6)                                               {0: [0], 1: [1]}
(TTTGCGCCAAGTCTGT, id-VDAC1P2)                                                     {0: [0]}
(TTTGCGCCAAGTCTGT, id-YBX1P1)                                                      {0: [0]}
(TTTGCGCCAAGTCTGT, id-YBX1P10)                                                     {0: [0]}
Name: mcl, Length: 102842, dtype: object
===>Analysis has been complete and results have been saved!
                                                                                 vignette  ...                                        mcl_bam_ids
(AAAGATGAGAAACGAG, exon-NM_000099.4-3)  {'graph_adj': {0: [], 1: [], 2: [], 3: [], 4: ...  ...  [3099117, 3099235, 3099257, 3099287, 3099354, ...
(AAAGATGAGAAACGAG, exon-NM_000100.4-3)  {'graph_adj': {0: [], 1: [], 2: [], 3: [], 4: ...  ...  [3171457, 3171495, 3171504, 3171564, 3171693, ...
(AAAGATGAGAAACGAG, exon-NM_000101.4-6)  {'graph_adj': {0: [], 1: [], 2: []}, 'int_to_u...  ...                        [2455005, 2455221, 2455320]
(AAAGATGAGAAACGAG, exon-NM_000146.4-3)  {'graph_adj': {0: [], 1: []}, 'int_to_umi_dict...  ...                                 [3008063, 3008100]
(AAAGATGAGAAACGAG, exon-NM_000146.4-4)  {'graph_adj': {0: [], 1: [], 2: [], 3: [], 4: ...  ...  [3009218, 3009245, 3009263, 3009571, 3009681, ...
...                                                                                   ...  ...                                                ...
(TTTGCGCCAAGTCTGT, id-TCEA1P2)          {'graph_adj': {0: []}, 'int_to_umi_dict': {0: ...  ...                                           [602919]
(TTTGCGCCAAGTCTGT, id-TXNP6)            {'graph_adj': {0: [], 1: []}, 'int_to_umi_dict...  ...                                   [799963, 799966]
(TTTGCGCCAAGTCTGT, id-VDAC1P2)          {'graph_adj': {0: []}, 'int_to_umi_dict': {0: ...  ...                                          [3284956]
(TTTGCGCCAAGTCTGT, id-YBX1P1)           {'graph_adj': {0: []}, 'int_to_umi_dict': {0: ...  ...                                          [2164172]
(TTTGCGCCAAGTCTGT, id-YBX1P10)          {'graph_adj': {0: []}, 'int_to_umi_dict': {0: ...  ...                                          [1503503]

[102842 rows x 12 columns]

2. Understanding results

After running, the generated files are shown in the following screenshot.

Image title
Fig 1. Generated files with MCL

In the output directory, there are three files:

Annotation

  • {method}_ave_ed_pos_bin.txt - in which UMI count summary results will be stored, indicating observed UMIs assigned to multiple genes.
  • {method}_dedup_sum.txt - deduplicated UMI counts across all reads in the newly generated bam file.
  • {method}_dedup.bam - in which the deduplicated reads will be stored.

Statistics provided in file mcl_ave_ed_pos_bin.txt are tabulated as follow:

ave_eds count
-1.0 52980
2.0 7
3.0 36
... ...

Annotation

  • the 1st column - average edit distance between UMIs.
  • the 2nd column - number of UMI-tagged reads observed in that edit distance.
  • * -1.0 in column edit distance represents that only one unique umi seen for all cell-by-gene types; in total there are 52980 reads seen with one unique umi.

Statistics provided in file mcl_dedup_sum.txt are tabulated as follow:

index {method}_umi_len ave_eds uniq_umi_len dedup_uniq_diff_pos dedup_read_diff_pos
('AAAGATGAGAAACGAG', 'exon-NM_000099.4-3') 6 8.0 6 0 0
('AAAGATGAGAAACGAG', 'exon-NM_000100.4-3') 14 8.0 14 0 0
('AAAGATGAGAAACGAG', 'exon-NM_000101.4-6') 3 8.0 3 0 0
... ... ... ... ... ...

Annotation

  • the 1st column - cell-by-gene types.
  • the 2nd column - deduplicated UMI counts (corresponding to the number of DNA molecules/transcripts) for a given cell-by-gene type.
  • the 3rd column - average edit distances between UMIs for given cell-by-gene types.
  • the 4th column - unique UMI counts for given cell-by-gene types.
  • the 5th column - difference in dedup UMI counts and original unique UMI counts.
  • the 6th column - difference in the number of dedup reads and original reads.