Residue contact

Residue contact#

Protein residue contacts are crucial for protein structural prediction and drug target interaction prediction. TMKit allows users to parse protein residue contacts predicted by PSICOV[1], FreeContact[2], CCMPred[3], Gremlin[4], GDCA[5], PlmDCA[6], MemConP[7], Membrain2[8], and DeepHelicon[9].

PSICOV, FreeContact, CCMPred, Gremlin, GDCA, and PlmDCA are canonical covariance methods, while MemConP, Membrain2, and DeepHelicon are machine learning methods specialized for transmembrane proteins.

Example usage#

We can use the following code to obtain the residue contacts of protein 1xqf chain A. Similar to other cases in our tutorial, there are commonly used parameters. Please the next section for details.

import tmkit as tmk

df1, df2 = tmk.rrc.read(
    prot_name='1xqf',
    seq_chain='A',
    fasta_fp='data/fasta/',
    pdb_fp='data/pdb/',
    xml_fp='data/xml/',
    dist_fp='data/rrc/',
    tool_fp='data/rrc/tool/',
    seq_sep_inferior=1,
    seq_sep_superior=None,
    tool='membrain2',
)

Attributes#

Attribute	Description
`pdb_fp`	path where a target PDB file is placed
`fasta_fp`	path where a target Fasta file is placed
`xml_fp`	path where a target XML file is placed
`dist_fp`	path where a file containing real distances between residues is placed (please check the file at ./data/rrc in the example dataset)
`tool_fp`	path where a protein residue contact map file is placed
`tool`	name of a contact prediction tool. It can be one of PSICOV, FreeContact, CCMPred, Gremlin, GDCA, PlmDCA, MemConP, Membrain2, and DeepHelicon
`seq_sep_inferior`	The lower bounds of how far any two residues are in pairs
`seq_sep_superior`	The upper bounds of how far any two residues are in pairs
`prot_name`	name of a protein in the prefix of a PDB file name (e.g., 1xqf in 1xqfA.pdb)
`seq_chain`	chain of a protein in the prefix of a PDB file name (e.g., A in 1xqfA.pdb). Parameter file_chain will be converted within the function

Output#

There are two Pandas dataframes. The first one df1 is the dataframe containing the predicted contacts by tool Membrain2.

print(df1)
contact_id_1  contact_id_2     score
              13            44  0.032846
              13            45  0.011669
              13            46  0.019312
              13            47  0.089862
              13            48  0.026575
...             ...           ...       ...
         308           349  0.044726
         308           350  0.080527
         308           351  0.039438
         308           352  0.034000
         308           353  0.074005
[19448 rows x 3 columns]

The second one df2 is the dataframe containing the real distances between two residues, such that.

Attribute	Description
fasta_id_1	Fasta id of the first residue
aa_1	Amino acid type of the first residue
pdb_id_1	PDB id of the first residue
fasta_id_2	Fasta id of the second residue
aa_2	Amino acid type of the second residue
pdb_id_2	PDB id of the second residue
dist	distance
is_contact	if they are in contact

print(df2)
fasta_id_1 aa_1 pdb_id_1 fasta_id_2 aa_2 pdb_id_2       dist is_contact
           13    I       15         44    T       46  23.495386          0
           13    I       15         45    Q       47  22.651615          0
           13    I       15         46    V       48   18.67347          0
           13    I       15         47    T       49  19.484049          0
           13    I       15         48    V       50   21.53894          0
...          ...  ...      ...        ...  ...      ...        ...        ...
      308    F      332        349    G      373  35.690994          0
      308    F      332        350    Y      374  32.043457          0
      308    F      332        351    K      375  38.532841          0
      308    F      332        352    L      376  40.355228          0
      308    F      332        353    A      377  40.803558          0
[19448 rows x 8 columns]

You can combine the two dataframes directly because they have been aligned this way below, which makes your research easier.

import pandas as pd
df = pd.concat([df1, df2], axis=1)
print(df)

It outputs:

       contact_id_1  contact_id_2     score  ... pdb_id_2       dist is_contact
              13            44  0.032846  ...       46  23.495386          0
              13            45  0.011669  ...       47  22.651615          0
              13            46  0.019312  ...       48   18.67347          0
              13            47  0.089862  ...       49  19.484049          0
              13            48  0.026575  ...       50   21.53894          0
...             ...           ...       ...  ...      ...        ...        ...
         308           349  0.044726  ...      373  35.690994          0
         308           350  0.080527  ...      374  32.043457          0
         308           351  0.039438  ...      375  38.532841          0
         308           352  0.034000  ...      376  40.355228          0
         308           353  0.074005  ...      377  40.803558          0

[19448 rows x 11 columns]