MutHTP

Contents

MutHTP#

The MutHTP database [1] collects a cohort of missense, insertion, and deletion mutation sites from the 5 commonly used resources: Humsavar, SwissVar, 1000 Genomes, COSMIC and ClinVar databases.

TMKit offers an interface, tmkit.mut to access the MutHTP database.

Example usage#

First, let’s download a database of MutHTP. In the example dataset, there is a folder called mutation. The path is ./data/mutation/, which is the place where we suggest users to manage the data used and generated.

import tmkit as tmk

tmk.mut.download_muthtp_db(
    version='2020',
    sv_fp='./data/mutation/'
)

After decompressing it, you will have MutHTP_2020.txt. We can now access this database using the following code.

import tmkit as tmk

df = tmk.mut.read_muthtp_db(
    muthtp_fpn='./data/mutation/MutHTP_2020.txt'
)
print(df)

Attributes#

Attribute	Description
`version`	version of a MutHTP database, for example, 2020
`muthtp_fpn`	path where a MutHTP database is placed
`sv_fp`	path to where you want to save a MutHTP database to be downloaded

See also

Please see here for better understanding the file-naming system.

Output#

Finally, you will see the following output, which shows 22 features in MutHTP, including Uniprot IDs (uniprot_id), PDB IDs (pdb_id), topological information (topology), etc. You can extract each of the feature in Python, e.g., df['topology'].

======>reading MutHTP...
======>MutHTP features are:
=========>No.1: id
=========>No.2: gene_id
=========>No.3: uniprot_id
=========>No.4: mutation_type
=========>No.5: chromosome_number
=========>No.6: origin_cell
=========>No.7: nucleotide_mutation_site
=========>No.8: protein_mutation_site
=========>No.9: pdb_id
=========>No.10: protein_structure_mutation_site
=========>No.11: 3D_structure
=========>No.12: interface
=========>No.13: transmembrane_domain
=========>No.14: topology
=========>No.15: disease
=========>No.16: disease_class
=========>No.17: uniprot_id_isoform
=========>No.18: neighbouring_residue
=========>No.19: source_database
=========>No.20: conservation_score
=========>No.21: odds_ratio
=========>No.22: type_passing_membrane
            id  gene_id  ... odds_ratio type_passing_membrane
0            1    ESYT2  ...          -            Multi-pass
1            2  SLC5A10  ...          -            Multi-pass
2            3  SLC5A10  ...          -            Multi-pass
3            4  SLC5A10  ...          -            Multi-pass
4            5  SLC5A10  ...          -            Multi-pass
...        ...      ...  ...        ...                   ...
206384  206385   SLC4A4  ...        NaN                   NaN
206385  206386   SLC4A4  ...        NaN                   NaN
206386  206387   SLC4A4  ...        NaN                   NaN
206387  206388   SLC4A4  ...        NaN                   NaN
206388  206389   SLC4A4  ...        NaN                   NaN

[206389 rows x 22 columns]