Residue contact

Residue contact#

Protein residue contacts are crucial for protein structural prediction and drug target interaction prediction. TMKit allows users to parse protein residue contacts predicted by PSICOV[1], FreeContact[2], CCMPred[3], Gremlin[4], GDCA[5], PlmDCA[6], MemConP[7], Membrain2[8], and DeepHelicon[9].

PSICOV, FreeContact, CCMPred, Gremlin, GDCA, and PlmDCA are canonical covariance methods, while MemConP, Membrain2, and DeepHelicon are machine learning methods specialized for transmembrane proteins.

Reminder of data

Please make sure that the build-in example dataset has been downloaded before you walk through the tutorial.

Example usage#

We can use the following code to obtain the residue contacts of protein 1xqf chain A. Similar to other cases in our tutorial, there are commonly used parameters. Please the next section for details.

import tmkit as tmk

df1, df2 = tmk.rrc.read(
    prot_name='1xqf',
    seq_chain='A',
    fasta_fp='data/fasta/',
    pdb_fp='data/pdb/',
    xml_fp='data/xml/',
    dist_fp='data/rrc/',
    tool_fp='data/rrc/tool/',
    seq_sep_inferior=1,
    seq_sep_superior=None,
    tool='membrain2',
)

Attributes#

Attribute

Description

pdb_fp

path where a target PDB file is placed

fasta_fp

path where a target Fasta file is placed

xml_fp

path where a target XML file is placed

dist_fp

path where a file containing real distances between residues is placed (please check the file at ./data/rrc in the example dataset)

tool_fp

path where a protein residue contact map file is placed

tool

name of a contact prediction tool. It can be one of PSICOV, FreeContact, CCMPred, Gremlin, GDCA, PlmDCA, MemConP, Membrain2, and DeepHelicon

seq_sep_inferior

The lower bounds of how far any two residues are in pairs

seq_sep_superior

The upper bounds of how far any two residues are in pairs

prot_name

name of a protein in the prefix of a PDB file name (e.g., 1xqf in 1xqfA.pdb)

seq_chain

chain of a protein in the prefix of a PDB file name (e.g., A in 1xqfA.pdb). Parameter file_chain will be converted within the function

See also

Please see here for better understanding the file-naming system.

Output#

There are two Pandas dataframes. The first one df1 is the dataframe containing the predicted contacts by tool Membrain2.

print(df1)
contact_id_1  contact_id_2     score
0                13            44  0.032846
1                13            45  0.011669
2                13            46  0.019312
3                13            47  0.089862
4                13            48  0.026575
...             ...           ...       ...
19443           308           349  0.044726
19444           308           350  0.080527
19445           308           351  0.039438
19446           308           352  0.034000
19447           308           353  0.074005
[19448 rows x 3 columns]

The second one df2 is the dataframe containing the real distances between two residues, such that.

Attribute

Description

fasta_id_1

Fasta id of the first residue

aa_1

Amino acid type of the first residue

pdb_id_1

PDB id of the first residue

fasta_id_2

Fasta id of the second residue

aa_2

Amino acid type of the second residue

pdb_id_2

PDB id of the second residue

dist

distance

is_contact

if they are in contact

print(df2)
fasta_id_1 aa_1 pdb_id_1 fasta_id_2 aa_2 pdb_id_2       dist is_contact
0             13    I       15         44    T       46  23.495386          0
1             13    I       15         45    Q       47  22.651615          0
2             13    I       15         46    V       48   18.67347          0
3             13    I       15         47    T       49  19.484049          0
4             13    I       15         48    V       50   21.53894          0
...          ...  ...      ...        ...  ...      ...        ...        ...
19443        308    F      332        349    G      373  35.690994          0
19444        308    F      332        350    Y      374  32.043457          0
19445        308    F      332        351    K      375  38.532841          0
19446        308    F      332        352    L      376  40.355228          0
19447        308    F      332        353    A      377  40.803558          0
[19448 rows x 8 columns]

You can combine the two dataframes directly because they have been aligned this way below, which makes your research easier.

import pandas as pd
df = pd.concat([df1, df2], axis=1)
print(df)

It outputs:

       contact_id_1  contact_id_2     score  ... pdb_id_2       dist is_contact
0                13            44  0.032846  ...       46  23.495386          0
1                13            45  0.011669  ...       47  22.651615          0
2                13            46  0.019312  ...       48   18.67347          0
3                13            47  0.089862  ...       49  19.484049          0
4                13            48  0.026575  ...       50   21.53894          0
...             ...           ...       ...  ...      ...        ...        ...
19443           308           349  0.044726  ...      373  35.690994          0
19444           308           350  0.080527  ...      374  32.043457          0
19445           308           351  0.039438  ...      375  38.532841          0
19446           308           352  0.034000  ...      376  40.355228          0
19447           308           353  0.074005  ...      377  40.803558          0

[19448 rows x 11 columns]