Fetch with customised metrics#
This tutorial will walk you through how to extract structure-related information from a Cath database using your customised metrics.
TMKit offers an interface, tmkit.cath
, to access the database.
Example usage#
Suppose you have two proteins of interest. They are 3udc
chain A
and 3rko
chain A
. Let’s define them in Python and render them in a Pandas dataframe as shown below.
import pandas as pd
prots = [['3udc', 'A'], ['3rko', 'A']]
df_prot = pd.DataFrame(prots, columns=['prot', 'chain'])
Then, we also need to read the downloaded Cath database (if you miss it, please see here) in the following way.
import tmkit as tmk
df = tmk.cath.read(
cath_fpn='./data/cath/cath-b-newest-all.txt',
groupby='version',
group='v4_2_0',
)
print(df)
If you are interested in funfam_number
, superfamily_id
and pdb_segments
of the proteins, you can tell TMKit by using a list like below.
metrics = [
'funfam_number',
'superfamily_id',
'pdb_segments',
]
Finally, we can extract all information about the two proteins using the tmk.cath.fftojson
function. The results will be saved in data/cath/processed.json
.
import tmkit as tmk
res = tmk.cath.fftojson(
df_prot=df_prot,
df_cath_domain=df,
sv_fpn='data/cath/processed.json',
targets=metrics,
)
print(res)
Attributes#
Attribute |
Description |
---|---|
|
version of the Cath database |
|
path to a file to save results |
|
path to the downloaded Cath database |
|
metric used to group data, e.g., version. There are 4 metrics in total, i.e., |
|
value of a metric. For example, if version is chosen, there are two, namely, |
|
Pandas dataframe of a series of proteins, consisting of protein names in the 1st column and chains in the 2nd column |
|
Pandas dataframe of the Cath database |
|
protein structure-related metrics, e.g., |
Output#
Finally, you will see the following output.
======>No.1 protein complex: 3udc
=========>domain id is: 3udcA01
=========>domain id is: 3udcA02
=========>domain id is: 3udcA03
=========>domain id is: 3udcB01
=========>domain id is: 3udcB02
=========>domain id is: 3udcB03
=========>domain id is: 3udcC01
=========>domain id is: 3udcC02
=========>domain id is: 3udcC03
=========>domain id is: 3udcD01
=========>domain id is: 3udcD02
=========>domain id is: 3udcD03
=========>domain id is: 3udcE01
=========>domain id is: 3udcE02
=========>domain id is: 3udcE03
=========>domain id is: 3udcF01
=========>domain id is: 3udcF02
=========>domain id is: 3udcF03
=========>domain id is: 3udcG01
=========>domain id is: 3udcG02
=========>domain id is: 3udcG03
======>No.2 protein complex: 3rko
=========>domain id is: 3rkoA00
=========>domain id is: 3rkoB02
=========>domain id is: 3rkoE00
=========>domain id is: 3rkoF01
=========>domain id is: 3rkoG00
=========>domain id is: 3rkoJ01
=========>domain id is: 3rkoK00
=========>domain id is: 3rkoL02
======>The file is saved.
{'3udc': {'A': {'01': {}, '02': {'funfam_number': 4564, 'superfamily_id': '2.30.30.60', 'pdb_segments': [{'pdb_code': '3udc', 'stop': '177', 'chain_code': 'A', 'start': '128'}]}, '03': {'funfam_number': None, 'superfamily_id': '3.30.70.100', 'pdb_segments': [{'pdb_code': '3udc', 'stop': '265', 'chain_code': 'A', 'start': '178'}]}}, 'B': {'01': {}, '02': {'funfam_number': 4564, 'superfamily_id': '2.30.30.60', 'pdb_segments': [{'pdb_code': '3udc', 'stop': '177', 'chain_code': 'B', 'start': '128'}]}, '03': {'funfam_number': None, 'superfamily_id': '3.30.70.100', 'pdb_segments': [{'pdb_code': '3udc', 'stop': '265', 'chain_code': 'B', 'start': '178'}]}}, 'C': {'01': {}, '02': {'funfam_number': 4564, 'superfamily_id': '2.30.30.60', 'pdb_segments': [{'pdb_code': '3udc', 'stop': '177', 'chain_code': 'C', 'start': '128'}]}, '03': {'funfam_number': None, 'superfamily_id': '3.30.70.100', 'pdb_segments': [{'pdb_code': '3udc', 'stop': '265', 'chain_code': 'C', 'start': '178'}]}}, 'D': {'01': {}, '02': {'funfam_number': 4564, 'superfamily_id': '2.30.30.60', 'pdb_segments': [{'pdb_code': '3udc', 'stop': '177', 'chain_code': 'D', 'start': '128'}]}, '03': {'funfam_number': None, 'superfamily_id': '3.30.70.100', 'pdb_segments': [{'pdb_code': '3udc', 'stop': '265', 'chain_code': 'D', 'start': '178'}]}}, 'E': {'01': {}, '02': {'funfam_number': 4564, 'superfamily_id': '2.30.30.60', 'pdb_segments': [{'pdb_code': '3udc', 'stop': '177', 'chain_code': 'E', 'start': '128'}]}, '03': {'funfam_number': None, 'superfamily_id': '3.30.70.100', 'pdb_segments': [{'pdb_code': '3udc', 'stop': '265', 'chain_code': 'E', 'start': '178'}]}}, 'F': {'01': {}, '02': {'funfam_number': 4564, 'superfamily_id': '2.30.30.60', 'pdb_segments': [{'pdb_code': '3udc', 'stop': '177', 'chain_code': 'F', 'start': '128'}]}, '03': {'funfam_number': None, 'superfamily_id': '3.30.70.100', 'pdb_segments': [{'pdb_code': '3udc', 'stop': '265', 'chain_code': 'F', 'start': '178'}]}}, 'G': {'01': {}, '02': {'funfam_number': 4564, 'superfamily_id': '2.30.30.60', 'pdb_segments': [{'pdb_code': '3udc', 'stop': '177', 'chain_code': 'G', 'start': '128'}]}, '03': {'funfam_number': None, 'superfamily_id': '3.30.70.100', 'pdb_segments': [{'pdb_code': '3udc', 'stop': '265', 'chain_code': 'G', 'start': '178'}]}}}, '3rko': {'A': {'00': {'funfam_number': None, 'superfamily_id': '1.20.58.1610', 'pdb_segments': [{'pdb_code': '3rko', 'stop': '126', 'chain_code': 'A', 'start': '15'}]}}, 'B': {'02': {}}, 'E': {'00': {'funfam_number': None, 'superfamily_id': '1.20.58.1610', 'pdb_segments': [{'pdb_code': '3rko', 'stop': '126', 'chain_code': 'E', 'start': '15'}]}}, 'F': {'01': {'funfam_number': None, 'superfamily_id': '1.20.120.1200', 'pdb_segments': [{'pdb_code': '3rko', 'stop': '160', 'chain_code': 'F', 'start': '1'}]}}, 'G': {'00': {}}, 'J': {'01': {'funfam_number': None, 'superfamily_id': '1.20.120.1200', 'pdb_segments': [{'pdb_code': '3rko', 'stop': '160', 'chain_code': 'J', 'start': '1'}]}}, 'K': {'00': {}}, 'L': {'02': {}}}}