Individual collation#
In this tutorial, we introduce tmkit.collate
, a module designed to collate a protein chain downloaded from PDBTM by comparing it with chains of the same protein from RCSB.
Tip
Collation is often necessary when working with PDBTM-derived protein chains, as PDBTM may transform or exclude certain chains present in the RCSB PDB structure file.
This tutorial provides a step-by-step example of how to detect and analyze these differences at the per-protein level.
Reminder of data
Please make sure that the build-in example dataset has been downloaded before you walk through the tutorial.
Example usage#
First, we need to prepare RCSB and PDBTM structures of proteins. We have put a few protein structures in the following folders if you have downloaded our example dataset. Alternatively, you can obtain them through this tutorial.
pdb_rcsb_fp = 'data/pdb/collate/rcsb/'
pdb_pdbtm_fp = 'data/pdb/collate/pdbtm/'
Then, we can check how many chains there are and what chains are contained in there by using the following code.
import tmkit as tmk
# PDBTM
chains = tmk.collate.chain(
prot_name='6cxh',
pdb_fp=pdb_pdbtm_fp,
)
print(chains)
# output
======>protein has chains ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
If we focus on a certain chain C
of protein 6cxh
, we can get how many other chains differ to each other or are the same. The output dataframe allows you to see there are chains GDEHF
from RCSB that are different from those from PDBTM. The transformation_detection
is used to check if the chain of focus is transformed by another chain from a RCSB PDB structure. untransformed
means it is not transformed by another chain. Please see the output below.
import tmkit as tmk
df, transformation_detection = tmk.collate.single(
prot_name='6cxh',
chain_focus='C',
pdb_rcsb_fp=pdb_rcsb_fp,
pdb_pdbtm_fp=pdb_pdbtm_fp,
)
# output
print(df)
prot_name chain pdbtm_chains rcsb_chains source diff same
0 6cxh C ABCDEFGH ABC rcsb GDEHF ACB
print(transformation_detection)
{'6cxh.C': 'untransformed'}
If we test protein 3pux
chain G
, we found that the PDB structure from PDBTM is the same as that from RCSB, shown below.
import tmkit as tmk
df, transformation_detection = tmk.collate.single(
prot_name='3pux',
chain_focus='G',
pdb_rcsb_fp=pdb_rcsb_fp,
pdb_pdbtm_fp=pdb_pdbtm_fp,
)
# output
print(df)
prot_name chain pdbtm_chains rcsb_chains source diff same
0 3pux G EFGAB EFGAB rcsb AEFGB
print(transformation_detection)
{'3pux.G': 'untransformed'}
Attributes#
Attribute |
Description |
---|---|
|
name of a protein in the prefix of a PDB file name (e.g., |
|
chain of a protein in the prefix of a PDB file name (e.g., |
|
path to a protein complex from RCSB |
|
path to a protein complex from PDBTM |
See also
Please see here for better understanding the file-naming system.
Output#
======>protein has chains ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
======>basic info:
prot_name chain pdbtm_chains rcsb_chains source diff same
0 3pux G EFGAB EFGAB rcsb AEFGB
prot_name chain pdbtm_chains rcsb_chains source diff same
0 3pux G EFGAB EFGAB rcsb AEFGB
{'3pux.G': 'untransformed'}