Single metric

Single metric#

TMkit can evaluate qualities of proteins using a QC metric. We currently compile 7 metrics shown as follows.

Attribute

Description

desc

biological description

met

determination method

bio_name

biological name

head

headline notation

mthm

number of helices

rez

resolution

seq

sequence information

Reminder of data

Please make sure that the build-in example dataset has been downloaded before you walk through the tutorial.

Example usage#

First, let’s define 5 transmembrane proteins to be used, protein 1xqf chain A, protein 1eq8 chain A, protein 6e3y chain E, protein 3pux chain G and protein 3udc chain A, and put them in a Pandas dataframe as follows.

import pandas as pd

prots = [['1xqf', 'A'], ['1eq8', 'A'], ['6e3y', 'E'], ['3pux', 'G'], ['3udc', 'A'], ['3rko', 'A']]
df_prot = pd.DataFrame(prots, columns=['prot', 'chain'])
df_prot = df_prot.rename(columns={
    0: 'prot',
    1: 'chain',
})
print(df_prot)

We can download their PDB structures from PDBTM and save them in ./data/pdb/pdbtm/.

import tmkit as tmk

tmk.seq.retrieve_pdb_from_pdbtm(
    prot_series=df_prot['prot'],
    sv_fp='./data/pdb/pdbtm/',
)
===>No.1 protein name: 1xqf
======>successfully downloaded!
===>No.2 protein name: 1eq8
======>successfully downloaded!
===>No.3 protein name: 6e3y
======>successfully downloaded!
===>No.4 protein name: 3pux
======>successfully downloaded!
===>No.5 protein name: 3udc
======>successfully downloaded!
===>No.6 protein name: 3rko
======>successfully downloaded!

We can then download their XML files from PDBTM and save them in ./data/xml/.

import tmkit as tmk

tmk.seq.retrieve_xml_from_pdbtm(
    prot_series=df_prot['prot'],
    sv_fp='./data/xml/',
)
===>No.1 protein name: 1xqf
======>successfully downloaded!
===>No.2 protein name: 1eq8
======>successfully downloaded!
===>No.3 protein name: 6e3y
======>successfully downloaded!
===>No.4 protein name: 3pux
======>successfully downloaded!
===>No.5 protein name: 3udc
======>successfully downloaded!
===>No.6 protein name: 3rko
======>successfully downloaded!

Finally, we can just use the following one command generate the results using one of the above-mentioned 7 metrics. Each time you can just alter what’s put in parameter metric.

import tmkit as tmk

df = tmk.qc.obtain_single(
    df_prot=df_prot,
    pdb_cplx_fp='./data/pdb/pdbtm/',
    fasta_fp='./data/fasta/',
    xml_fp='./data/xml/',
    sv_fp='./data/qc/',
    metric='desc', # 'desc' 'met', 'bio_name', 'head', 'mthm', 'seq'
)
print(df)

Attributes#

Attribute

Description

df_prot

Pandas dataframe storing protein names and chain names

pdb_cplx_fp

path where a protein complex file from PDBTM is placed

fasta_fp

path where a protein Fasta file is placed

xml_fp

path where a protein XML file from PDBTM is placed

sv_fp

path to save files

metric

a QC metric: rez, met, bio_name, head, desc, mthm, seq

See also

Please see here for better understanding the file-naming system.

Output#

=========>protein 1xqf chain A with rez 1.8
=========>protein 1eq8 chain A with rez nan
=========>protein 6e3y chain E with rez 3.3
=========>protein 3pux chain G with rez 2.3
=========>protein 3udc chain A with rez 3.35
=========>protein 3rko chain A with rez 3.0
   prot chain   rez
0  1xqf     A  1.80
1  1eq8     A   NaN
2  6e3y     E  3.30
3  3pux     G  2.30
4  3udc     A  3.35
5  3rko     A  3.00
=========>protein 1xqf chain A with met x-ray diffraction
=========>protein 1eq8 chain A with met unknown
=========>protein 6e3y chain E with met x-ray diffraction
=========>protein 3pux chain G with met x-ray diffraction
=========>protein 3udc chain A with met x-ray diffraction
=========>protein 3rko chain A with met x-ray diffraction
   prot chain                met
0  1xqf     A  x-ray diffraction
1  1eq8     A            unknown
2  6e3y     E  x-ray diffraction
3  3pux     G  x-ray diffraction
4  3udc     A  x-ray diffraction
5  3rko     A  x-ray diffraction
=========>protein 1xqf chain A with bio_name the mechanism of ammonia transport based on the crystal structure of amtb of e. coli.
=========>protein 1eq8 chain A with bio_name three-dimensional structure of the pentameric helical bundle of the acetylcholine receptor m2 transmembrane segment
=========>protein 6e3y chain E with bio_name cryo-em structure of the active, gs-protein complexed, human cgrp receptor
=========>protein 3pux chain G with bio_name crystal structure of an outward-facing mbp-maltose transporter complex bound to adp-bef3
=========>protein 3udc chain A with bio_name crystal structure of a membrane protein
=========>protein 3rko chain A with bio_name crystal structure of the membrane domain of respiratory complex i from e. coli at 3.0 angstrom resolution
   prot chain                                           bio_name
0  1xqf     A  the mechanism of ammonia transport based on th...
1  1eq8     A  three-dimensional structure of the pentameric ...
2  6e3y     E  cryo-em structure of the active, gs-protein co...
3  3pux     G  crystal structure of an outward-facing mbp-mal...
4  3udc     A            crystal structure of a membrane protein
5  3rko     A  crystal structure of the membrane domain of re...
=========>protein 1xqf chain A with head transport protein
=========>protein 1eq8 chain A with head signaling protein
=========>protein 6e3y chain E with head signaling protein
=========>protein 3pux chain G with head hydrolase/transport protein
=========>protein 3udc chain A with head membrane protein
=========>protein 3rko chain A with head oxidoreductase
   prot chain                         head
0  1xqf     A            transport protein
1  1eq8     A            signaling protein
2  6e3y     E            signaling protein
3  3pux     G  hydrolase/transport protein
4  3udc     A             membrane protein
5  3rko     A               oxidoreductase
=========>protein 1xqf chain A with desc TRANSPORT PROTEIN
=========>protein 1eq8 chain A with desc SIGNALING PROTEIN
=========>protein 6e3y chain E with desc SIGNALING PROTEIN
=========>protein 3pux chain G with desc HYDROLASE/TRANSPORT PROTEIN
=========>protein 3udc chain A with desc MEMBRANE PROTEIN
=========>protein 3rko chain A with desc OXIDOREDUCTASE
   prot chain                         desc
0  1xqf     A            TRANSPORT PROTEIN
1  1eq8     A            SIGNALING PROTEIN
2  6e3y     E            SIGNALING PROTEIN
3  3pux     G  HYDROLASE/TRANSPORT PROTEIN
4  3udc     A             MEMBRANE PROTEIN
5  3rko     A               OXIDOREDUCTASE
=========>protein 1xqf chain A with mthm 11.0
=========>protein 1eq8 chain A with mthm 1.0
=========>protein 6e3y chain E with mthm 1.0
=========>protein 3pux chain G with mthm 6.0
=========>protein 3udc chain A with mthm 3.0
=========>protein 3rko chain A with mthm 3.0
   prot chain  mthm
0  1xqf     A  11.0
1  1eq8     A   1.0
2  6e3y     E   1.0
3  3pux     G   6.0
4  3udc     A   3.0
5  3rko     A   3.0
=========>File failed: 1xqf A
=========>File failed: 1eq8 A
=========>File failed: 6e3y E
=========>File failed: 3pux G
=========>File failed: 3udc A
=========>File failed: 3rko A
   prot chain                                             seq_aa  seq_len
0  1xqf     A  AVADKADNAFMMICTALVLFMTIPGIALFYGGLIRGKNVLSMLTQV...    362.0
1  1eq8     A                            EKMSTAISVLLAQAVFLLLTSQR     23.0
2  6e3y     E  EANYGALLRELCLTQFQVDMEAVGETLWCDWGRTIRSYRELADCTW...    115.0
3  3pux     G  AMVQPQKARLFITHLLLLLFIAAIMFPLLMVVAISLRQGNFATGSL...    293.0
4  3udc     A  YDIKAVKFLLDVLKILIIAFIGIKFADFLIYRFYKLYSKSKIQLPQ...    267.0
5  3rko     A  AFAIFLIVAIGLCCLMLVGGWFLGGRARARLRLSAKFYLVAMFFVI...     95.0

You can use the information to do QC. For example, proteins that are determined by X-ray methods are reserved. You should do this as follows. Then, protein 1eq8 chain A will be eliminated.

df[df['met'] == 'x-ray diffraction']