1.3 Bulk RNA seq
tresor.gene.library is the module that can simulate sequencing libraries at the bulk RNA-seq level. This is used to simulate reads that are tagged by genes, respectively.
Usage¶
ts.gene.library(
    r_root='D:/Programming/R/R-4.3.2/',
    num_genes=6,
    num_samples=2,
    simulator='spsimseq',
    seq_num=50,
    len_params={
        'umi': {
            'umi_unit_pattern': 3,
            'umi_unit_len': 12,
        },
        'seq': 100,
    },
    seq_params={
        'custom': 'BAGC',
        'custom_1': 'V',
    },
    material_params={
        'fasta_cdna_fpn': to('data/Homo_sapiens.GRCh38.cdna.all.fa.gz'),  # None False
    },
    is_seed=True,
    working_dir=to('data/simu/docs/'),
    condis=['umi', 'custom', 'seq', 'custom_1'],
    sim_thres=3,
    permutation=0,
    mode='short_read',  # long_read short_read
    verbose=False,
)
| 1 2 3 4 5 6 7 8 9 10 11 12 13 |  | 
Code highlighted is executed as equivalent to do coding in the following way. We employed the SPsimSeq tool1 to simulate bulk RNA-seq data. In this case, we simulated a sample-by-gene count matrix containing 6 genes and 2 samples.
| 1 2 3 4 5 6 7 8 9 |  | 
The count matrix is formed below.
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |  | 
Then, we use this count matrix as input to Tresor's corresponding module ts.gene.library.
Attributes¶
Illustration
| Attribute | Description | 
|---|---|
| seq_num | number of RNA molecules. 50by default | 
| len_params | lengths of different components of a read | 
| seq_params | sequences of different components of a read, It allows users to add their customised sequences | 
| material_params | a Python dictionary. Showing if cDNA libraries are provided, please use key word fasta_cdna_fpn. The human cDNA library can be downloaded through the Ensembl genome database | 
| num_genes | number of genes | 
| num_samples | number of samples | 
| r_root | path to the R executable | 
| simulator | computational tool to generate a sample-by-gene count matrix | 
| is_seed | if seeds are used to simulate sequencing libraries. This is designed to make in silico experiments reproducible | 
| working_dir | working directory where all simulation results are about to be saved | 
| condis | names of components that a read contains. It can contains an unlimited number of read components | 
| sim_thres | similarity threshold. 3by default | 
| permutation | permutation times | 
| mode | long_readorshort_read | 
| verbose | whether to print intermediate results | 
| Attribute | Description | 
|---|---|
| cfpn | location to the yaml configuration file. Users can specify the atrributes illustrated on the Python tab in the .ymlfile. | 
| snum | number of sequencing molecules | 
| rfpn | path to the R executable | 
| nspl | number of samples | 
| ngene | number of genes | 
| gsimulator | computational tool to generate a sample-by-gene count matrix | 
| permut | permutation times | 
| sthres | similarity threshold. 3by default | 
| wd | working directory where all simulation results are about to be saved | 
| md | long_readorshort_readmode | 
| is | if seeds are used to simulate sequencing libraries. This is designed for reproducible in silico experiments | 
| vb | whether to print intermediate results | 
```
Output¶
Console¶
{'seq_num': 50, 'seq_params': {'custom': 'BAGC', 'custom_1': 'V'}, 'material_params': {'fasta_cdna_fpn': 'D:\\Document\\Programming\\Python\\tresor\\tresor\\data/Homo_sapiens.GRCh38.cdna.all.fa.gz'}, 'mode': 'short_read'}
Finished
Understanding files¶
The resultant files of the simulated sequencing library are shown in the following picture.
 
Annotation
For speeding up computation, we simulated the library for sample 0, which has genes 3, 4, and 5.
- seq_s_0_g_3.txt - sequences for sample 0 and gene 3
- umi_s_0_g_3.txt - UMIs for sample 0 and gene 3
- cdna_ids_alone_s_0_g_3.txt - cDNA IDs for sample 0 and gene 3
In this case, we used homotrimer blocks to simulate UMIs where the length of each UMI is set to be 36 containing 12 trimer blocks.
 
The sequences are randomly chosen from the input human cDNAs and truncated according to the length of each short read.
 
The sequencing library is tabulated to a dataframe. Each row shows the necessary information about the read 1
- Sequence
- Identifier
- Source
Attention
The identifier is composed of ID of the molecule, character s for sample, ID of the sample, character g for gene, and ID of the gene, demarked by *.
Init means a read 1 is a sequence from the sequencing library, to differ from those from PCR amplification.
 
For sample 0, the chosen gene symbols of gene 3 are recorded, which correspond to the taken truncated sequences by order.
 
- 
Alemu Takele Assefa, Jo Vandesompele, Olivier Thas, SPsimSeq: semi-parametric simulation of bulk and single-cell RNA-sequencing data, Bioinformatics, Volume 36, Issue 10, May 2020, Pages 3276–3278, https://doi.org/10.1093/bioinformatics/btaa105 ↩