Pipeline 2 Drug Target Interaction Prediction
In this review, we supplid a deep learning method, Drutai, for sequence-based drug discovery. It uses a dataset of drug target interactions (DTIs) downloaded from DrugBank with a version: 5.1.8. The proteins are from human. Now, we show an example of whether we could use PyPropel together with other tools to quickly train a machine learning model for predicting DTIs specific to human transmembrane proteins.
We walk you through a detailed, hands-on tutorial in achieving this purpose. First, we download the human transmembrane proteins from UniProt. Second, we download the DTIs from DrugBank. The data can be fetched from the PyPropel repository through the release
tab, where the training and test datasets of Drutai can also be found.
Abstract
The key idea is to get the portion of Drutai's training and test proteins overlapped with the transmenbrane proteome. Using them, we demonstrate how to train a new predictor. Importantly, the idea can be supported step-by-step by PyPropel because the combination of a few APIs can achieve it.
You will specifically see how PyPropel and TMKit can batch pre-process the structures of 10,971 protein complexes files (including structures of 58,060 protein chains). Finally, we will put the generated data for machine learning using scikit-learn.
It includes 4 procedures:
Features
- Data preparation
- Generation of training and test datasets
- Feature extraction
- Machine learning with random forest
Let's first import both of the libraries.
Python
1 2 |
|
Additionally, we will also use NumPy and Pandas for quickly processing data.
Python
1 2 |
|
1. Data preparation¶
First, we need to split the transmembrane proteome into separate Fasta files to gain the sequences and identifiers of proteins.
Python
1 2 3 4 5 6 |
|
Similarly, we split the entire protein set (approved.fasta
) into individual protein files.
Info
Due to different formats of the headers in Fasta files between UniProt and DrugBank, we need to specify mode='uniprot'
or mode='drugbank'
there to tell PyPropel to take action to tailor these formats.
Python
1 2 3 4 5 6 |
|
Warning
Becasue the training set of proteins in Drutai contains the portion of proteins categorised into experimental
by DrugBank, we will need to also extract the entire proteins from expt.fasta
.
We can obtain the identifiers of all UniProt's and DrugBank's transmembrane proteins this way.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
We can gain an understanding of the quantity of drug targets that are overlapped with transmembrane proteome.
Python
1 2 |
|
Download training and test sets of Drutai from PyPropel's Github repository. They are
train_pl.txt.txt
train_nl.txt
positive_test.txt
negative_test.txt
For example, DTIs are arranged in the train_pl.txt.txt
file as below.

2. Generation of training and test datasets¶
We can build the training set of proteins.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
We can build the test set of proteins.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
It is important to shuffle the training dataset to ensure randomness for better machine learning.
Python
1 |
|
3. Feature extraction¶
Let's now engineer a set of protein and chemical features into a X
matrix for machine learning.
To gain chemical features, we use the RDKit
tool (please ensure that it has been installed in your conda environment before use). The all.sdf
contains the structures of all molecules downloaded from DrugBank (version: 5.1.8). We store them in a dictionary mols
for easily accessing them whenever use.
Python
1 2 3 4 5 6 7 |
|
We set a for loop to obtain each pair of a protein and a drug molecule each time.
Info
The protein features include: AAC, DAC, TAC, CKSNAP, and AVEANF. The usage of all of them are shown in the documentation of PyPropel.
We can first use protein features of AAC, DAC, and AVEANF only to gain the first sight on what the prediction performance looks like, as below.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
In addition, during the feature extraction, the way we put features in can be equivalently used in putting features generated by other tools, for example, PyProtein.
Python
1 2 3 4 5 |
|
In the training set, we have 9426 DTIs in total. We can train the first 8000 DTIs and take the rest of them for validation of machine learning performance.
Python
1 2 3 4 5 6 |
|
Output
[[ 7.3016003e-02 2.8571000e-02 2.8571000e-02 ... 5.0414350e-02
3.5518999e+00 1.3252400e+02]
[ 4.2826999e-02 2.5696000e-02 5.3532999e-02 ... 4.4115603e-02
-1.4215300e+00 8.4962303e+01]
[ 5.9880000e-02 5.9880000e-03 5.3892002e-02 ... 3.6085926e-02
-2.1226001e+00 2.7285200e+01]
...
[ 1.5277800e-01 2.1825001e-02 3.7698001e-02 ... 3.0565832e-02
-3.4000000e-01 1.8316799e+01]
[ 9.7156003e-02 3.3174999e-02 3.7914999e-02 ... 3.1797495e-02
3.3736999e+00 1.1967500e+02]
[ 8.2608998e-02 3.2609001e-02 2.1739000e-02 ... 3.8699895e-02
3.8090000e+00 1.1576270e+02]]
(9426, 442)
4. Machine learning with random forest¶
3.1 training a random forest model¶
Python
1 2 3 |
|
3.2 evaluation¶
Python
1 2 3 4 |
|
If we use protein features only, the aaccuracy is
Output
Aaccuracy: 0.7349228611500701
If we further add chemical features of LogP and molar refractivity of drug molecules, the aaccuracy becomes
Output
Aaccuracy: 0.7496493688639552
We can see the extra introduction of some features, prediction performance gets boosted.