Pred-MutHTP#

The Pred-MutHTP database[1] contains predicted disease‐causing and neutral mutations in human transmembrane proteins. In the absence of experimentally-verified records, these are extremely useful for providing information about the status of mutations that occur in amino acid sites of human transmembrane proteins.

TMKit offers an interface (tmkit.mut) to access the database.

Reminder of data

Please make sure that the build-in example dataset has been downloaded before you walk through the tutorial.

Download and read#

First, let’s download the database of Pred-MutHTP. In the example dataset, there is a folder called mutation. The path is ./data/mutation/, which is the place where we suggest users to manage the data used and generated.

The database is called pred_varhtp_mut.zip. You should decompress it after downloading.

import tmkit as tmk

tmk.mut.download_predmuthtp_db(
    sv_fp='./data/mutation/'
)

After decompressing it, you will have pred_varhtp_mut.csv. We can now access this database using the following code.

import tmkit as tmk

df = tmk.mut.read_predmuthtp_db(
    pred_muthtp_fpn='./data/mutation/pred_varhtp_mut.csv'
)
print(df)

Finally, you will see the following output, which shows 5 features in Pred-MutHTP, including:

  • uniprot_id

  • protein_mutation_site

  • topology

  • mutation_type

  • mut_prob

You can extract each of the feature in Python, e.g., df['topology'].

Attributes#

Attribute

Description

pred_muthtp_fpn

path where the Pred-MutHTP database is placed

sv_fp

path to where you want to save the Pred-MutHTP database to be downloaded

See also

Please see here for better understanding the file-naming system.

Output#

======>reading Pred-MutHTP...
======>Pred-MutHTP features are:
=========>No.1: uniprot_id
=========>No.2: protein_mutation_site
=========>No.3: topology
=========>No.4: mutation_type
=========>No.5: mut_prob
         uniprot_id protein_mutation_site  ...    mutation_type mut_prob
0            A0PJX4                   M1A  ...          Neutral    0.731
1            A0PJX4                   M1C  ...          Neutral    0.731
2            A0PJX4                   M1D  ...  Disease-causing    0.745
3            A0PJX4                   M1E  ...  Disease-causing    0.745
4            A0PJX4                   M1F  ...  Disease-causing    0.745
...             ...                   ...  ...              ...      ...
54962537     Q9Y277                 A283S  ...          Neutral    0.820
54962538     Q9Y277                 A283T  ...          Neutral    0.525
54962539     Q9Y277                 A283V  ...  Disease-causing    0.542
54962540     Q9Y277                 A283W  ...  Disease-causing    0.690
54962541     Q9Y277                 A283Y  ...  Disease-causing    0.658

[54962542 rows x 5 columns]

Split into individual files#

This will split the Pred-MutHTP database into individual files according to the UniProt accession codes of human transmembrane proteins.

import tmkit as tmk

df = tmk.mut.read_predmuthtp_db(
    pred_muthtp_fpn='./data/mutation/pred_varhtp_mut.csv'
)

tmk.mut.split_predmuthtp(
    pred_muthtp_df=df,
    sv_fp='data/mutation/'
)

Attributes#

Attribute

Description

pred_muthtp_df

Pandas dataframe of the Pred-MutHTP database

pred_muthtp_fpn

path where the Pred-MutHTP database is placed

sv_fp

path to where you want to save the Pred-MutHTP database to be downloaded

See also

Please see here for better understanding the file-naming system.

Output#

In the console, it prints the output as shown below.

======>5185 uniprot proteins
=========>Splitting No.0 protein from Pred-MutHTP
=========>Splitting No.1 protein from Pred-MutHTP
=========>Splitting No.2 protein from Pred-MutHTP
=========>Splitting No.3 protein from Pred-MutHTP
......

Access an individual file#

Now, we can see what is contained in an individual file. Let’s check this protein A0PK00 (UniProt accession code) using the following code.

import tmkit as tmk

df = tmk.mut.read_split_predmuthtp(
    pred_split_muthtp_fpn='data/mutation/A0PK00.predmuthtp'
)

Attributes#

Attribute

Description

pred_split_muthtp_fpn

path where an individual file from the Pred-MutHTP database is placed

See also

Please see here for better understanding the file-naming system.

Output#

In the console, it prints the output as shown below.

======>reading split Pred-MutHTP...
     uniprot_id protein_mutation_site  mut_prob
0        W5XKT8                  Y34A     0.698
1        W5XKT8                  Y34C     0.567
2        W5XKT8                  Y34D     0.551
3        W5XKT8                  Y34E     0.642
4        W5XKT8                  Y34F     0.785
...         ...                   ...       ...
6151     W5XKT8                 N324S     0.598
6152     W5XKT8                 N324T     0.711
6153     W5XKT8                 N324V     0.859
6154     W5XKT8                 N324W     0.874
6155     W5XKT8                 N324Y     0.809

[6156 rows x 3 columns]