Pipeline 1 Interaction Site Prediction
We walk you through a detailed, hands-on tutorial in predicting interaction sites in transmembrane proteins using PyPropel in tandem with TMKit, an external tool specialising in sequence analysis of transmembrane proteins.
You will specifically see how PyPropel and TMKit can batch pre-process the structures of 10,971 protein complexes files (including structures of 58,060 protein chains). Finally, we will put the generated data for machine learning using scikit-learn.
It includes 5 modules:
Features
- Pre-processing
- Feature extraction
- Machine learning with random forest
- Machine learning evaluation metrics
- Visualisation of evaluation metrics
Let's first import both of the libraries.
Python
1 2 |
|
Additionally, we will also use NumPy and Pandas for quickly processing data.
Python
1 2 |
|
1. Pre-processing¶
1.1 Downloading a PDBTM database¶
Then, we can download an up-to-date list of transmembrane proteins from the PDBTM database (version: 12.06.2024
). Here, we also attach other two lists (version: 06.30.2023
and 10.02.2023
), which were previously downloaded. The latest version of transmembrane proteins can be found here.
Python
1 2 3 4 5 |
|
Check whether the latest list of proteins have been duplicated between one another due to technical errors. Then, save the deduplicated list of proteins (if you use version 12.06.2024
for example).
Python
1 2 3 4 5 6 7 8 |
|
We read every two latest lists for getting the updated proteins.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Find the portion of proteins updated in the latest version of PDBTM. Then, store it as prot_series
.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Retrive PDBTM structures and store them in data/pdbtm/cplx/
. As the protein structures are transposed from RCSB PDB files, it is necessary to denote kind
as tr
.
Python
1 2 3 4 5 |
|
Retrive PDBTM XML files (useful for getting topological information) and store them in data/xml/
.
Python
1 2 3 4 |
|
1.2 Cleaning structural data¶
We can go on with how we extract the structure of each chain from a protein complex structure stored in PDB.
There are two ways to get the list of chains with respect to their proteins.
Option 1. Fetch all structures gathered accumulatively from updated list. This is because in each updated list, some proteins from previous lists go away yet have been downloaded. We need to include those that go away.
Python
1 2 3 4 5 6 7 8 |
|
Then, we can use pp.str.chains
to get all chains from the complex structure of each protein. After then, we save the list.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
Option 2. Get chains from all previously gathered lists of proteins by getting the union of all of them. We can do like this.
Warning
Make sure each time we have two headers prot
and chain
in the dataframe (df
) for indicating proteins and chains. Because other functions might need this information as required.
Python
1 2 3 4 |
|
Then, use the pp.dataset.download_pack
function to extract the structures of all chains from the structures of all protein complexes. To this end, we need to specify where complexes are located (e.g. data/pdbtm/cplx/
) and where structures of all chains are to be stored (e.g. data/pdb/
).
Note
Please note that pp.dataset.download_pack
will also refine the structures of all chains (see this page). It includes reformatting structures and removing HETATM
. It will also tell you if the sequence of a protein chain stored in XML
file is the same as it is in the PDB file. It will extract the sequence to be stored in the FASTA
format.
Python
1 2 3 4 5 6 7 |
|
You will see the files as shown in the following screenshoots.



1.3 Quality control of data¶
After these files above are obtained, everything is all set before we exactly work on quality control (QC) of the sequence, topological, and structural data.
We can use tmk.qc.integrate
or tmk.qc.obtain_single
to obtain the metrics that we can use to remove or retain protein chains for analysis.
For simplicity, we showcase the generation of each metric based on a metric given. We are interested in nchain
, rez
, met
, mthm
, seq
, corresponding to the numebr of chains in a protein complex, resolution, resolving experimental method, number of transmembrane helices, and amino acid sequences. Apart from these, we can also order the generation of something extra by using bio_name
, head
, and desc
for how the protein is named biologically and what is the head of the PDB file, and general description of the protein strucuture.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
Then, you will see

Then, we can do a separate coding procedure for integrating these metrics together within a single file by using the following codes.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
1.4 Integrtation of QC metrics¶

But you should see the problem now, for metrics met
and nchain
, blanks remain not filled out. This is because, we generate them at the protein complex level rather than protein chain level. But within the metric file, proteins are rendered at the protein chain levels. Thus, we need to step up the fill of the two metrics in the table. We can make it by
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
This results in a file called integrate1.xlsx
. We can see the blanks have all been filled out for the two metrics.

1.5 Generating datasets¶
Next, we can screen protein chains by setting restrictions, thus generating a dataset.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
This results in a dataset of 7148 protein chains screened for use.

For all of these protein complexes with at least two protein chains, we need to calculate if protein chains within the complexes are spatially and physically in interaction. This information will be treated as the labels of each protein site. We can achieve this with two ways. We can rekcon two protein chains as being in interaction if their distance is within an angstrom (Ã…) of 5.5.
- Batch processing them on cloud.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
- (Recommend) Or, you can still generate all of them on your own computer by the following code. This might take long to take all. You can simply extract a small portion of these proteins for training purposes. Then, if you famililarise them, you could perform all of these operations on cloud.
Python
1 2 3 4 5 6 7 8 9 |
|
This will result in a lot of text files per protein complex.

Then, we can merge them all as a merge.txt
file for easily accessing the information.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
After this, we can get the overlap of both the proteins screen from QC analysis and the proteins that are in interaction. We can do the following. Then save it as dataset_post_ccheck.xlsx
.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|

Now, you have the dataset for machine learning!
2. Feature extraction¶
We can start feature extraction for these protein chains. But for better illustration, we only take two example protein chains, that is, protein 1xqf
chain A
(for training a machine learning model) and protein 1aij
chain L
(for testing the model).
First, we can generate the label y
.
Calculate the distance first.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|

Labelling them according to the distance then.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
is_contact
shows the label of each amino acid site.
Output
19/12/2024 15:58:15 logger: ================>Labeling data...
19/12/2024 15:58:15 logger: ================>Time to label distances 1xqf A: 0.006977558135986328s.
fasta_id aa pdb_id dist_1 dist_2 is_contact
0 1 A 3 11.391303 11.278594 0
1 2 V 4 7.974939 9.416605 0
2 3 A 5 7.700507 6.696266 0
3 4 D 6 3.313506 5.633641 1
4 5 K 7 6.614738 3.045629 1
.. ... .. ... ... ... ...
357 358 G 382 35.826157 13.476321 0
358 359 L 383 32.409570 14.432422 0
359 360 R 384 32.183815 14.324697 0
360 361 V 385 28.519402 10.922935 0
361 362 P 386 32.594830 10.024420 0
[362 rows x 6 columns]
19/12/2024 15:58:15 logger: ================>Labeling data...
19/12/2024 15:58:15 logger: ================>Time to label distances 1aij L: 0.0030066967010498047s.
fasta_id aa pdb_id dist_1 dist_2 is_contact
0 1 A 1 3.359507 3.820152 1
1 2 L 2 4.832898 2.840869 1
2 3 L 3 3.621853 3.450767 1
3 4 S 4 6.046180 2.841488 1
4 5 F 5 3.438960 3.688982 1
.. ... .. ... ... ... ...
276 277 G 277 2.910155 43.177975 1
277 278 G 278 3.036153 40.448544 1
278 279 I 279 2.979592 36.381110 1
279 280 N 280 2.980944 38.922234 1
280 281 G 281 3.484761 42.785560 1
[281 rows x 6 columns]
To generate X
, we extract features for the two proteins. But for demenstration here, we only involve amino acid properties in this process. There are 23 properties available to use, including
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
Then, we mainly use the TMKit
tool to generate these features.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
|
Let's now check X
and y
Python
1 2 3 4 5 6 7 8 9 |
|
Output
[[1. 1. 1. ... 0.16766467 0.39506173 0. ]
[1. 1. 1. ... 0.48502994 0.12345679 0. ]
[1. 1. 1. ... 0.16766467 0.39506173 0. ]
...
[0. 1. 0.33333333 ... 0.7245509 0.69135802 0.23636364]
[1. 1. 1. ... 0.48502994 0.12345679 0. ]
[1. 1. 1. ... 0.17664671 0.38271605 0.14181818]]
[[1. 1. 1. ... 0.16766467 0.39506173 0. ]
[1. 1. 1. ... 0.64670659 0. 0. ]
[1. 1. 1. ... 0.64670659 0. 0. ]
...
[1. 1. 1. ... 0.64670659 0.03703704 0. ]
[1. 1. 1. ... 0.31736527 0.82716049 0.48363636]
[1. 1. 1. ... 0. 0.50617284 0.26909091]]
[0 0 0 1 1 1 1 1 1 1 0 ... 0 0 0 0 0 0 0 0]
[1 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1]
3. Machine learning with random forest¶
Then, we can train a machine learning model by using random forest algorithms. We can draw upon the use of sklearn
library.
Python
1 2 3 4 5 6 7 8 |
|
Output
Accuracy: 0.5765124555160143
4. Machine learning evaluation metrics¶
Before post-processing and post-analysing the machine learning data, we need to save the prediction probabilities in tma300
formats, having 1st column as amino acid positions, 2nd column as amino acid symbols, and 3rd column as prediction probabilities.
Python
1 2 3 4 |
|
Then, batch generating comprehensive sets of evaluation metrics.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Warning
We made the evaluation in transmembrane areas predicted by Phobius. Then, we need to put this file in topo_fp='data/isite/'
. This file can be downloaded from the PyPropel repository through the release
tab.
Then, we will see a lot of files of evaluation metrics there.

The metrics are tabulated in JSON formats.

5. Visualisation of evaluation metrics¶
Finally, we can plot the evaluation metrics using PyPropel. For example, the ROC curves.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
