Foldseek-AlphaFold2

Foldseek-AlphaFold2#

In this tutorial, we introduce how TMKit can use a structural alignment method, Foldseek[1], for seeking functional similar protein structures to an input protein struture by searching a few databases.

Example usage#

First, a good scenario where Foldseek is applied is when we need to understand a predicted structure of a protein. For conveniences, we use 4 AlphaFold2-predicted structures of proteins P63092, Q9B6E8, P07256, and P63027. They are saved in ./data/.

import tmkit as tmk
import pandas as pd

prot_series = pd.Series(['P63092', 'Q9B6E8', 'P07256', 'P63027'])
tmk.seq.retrieve_pdb_alphafold(
    prot_series=prot_series,
    sv_fp='./data/',
)

Output#

===>No.1 protein name: P63092
======>successfully downloaded!
===>No.2 protein name: Q9B6E8
======>successfully downloaded!
===>No.3 protein name: P07256
======>successfully downloaded!
===>No.4 protein name: P63027
======>successfully downloaded!

Then, we can run tmk.seq.retrieve_foldseek to search a few databases by Foldseek. By default, the databases include afdb50, afdb-swissprot, afdb-proteome, cath50, mgnify_esm30, pdb100, and gmgcl_id. We can save the Foldseek results in ./data/ as well.

import tmkit as tmk

tmk.seq.retrieve_foldseek(
    pdb_fp='./data/',
    prot_name='P63027', # https://alphafold.ebi.ac.uk/entry/P63027
    sv_fp='./data/',
)

Output#

===>Searching databases by foldseek...
===>Results have been saved!

Then, we can check the results using the following code. For example, we want to check the result for P63027 and the file is saved as ./data/P63027_foldseek_result.gz. Please find the output in section 3 Output below.

import tarfile
import pandas as pd

with tarfile.open('./data/P63027_foldseek_result.gz', "r") as tar:
    csv_path = tar.getnames()[0]
    for i in tar.getnames():
        print(tar.extractfile(i))
        df = pd.read_csv(tar.extractfile(i), header=None, sep="\t")
        print(df)

Output#

===>Database: alis_afdb-proteome.m8
         0   ...                               20
0   job.pdb  ...                Rattus norvegicus
1   job.pdb  ...                      Danio rerio
2   job.pdb  ...                     Mus musculus
3   job.pdb  ...              Trichuris trichiura
4   job.pdb  ...       Cladophialophora carrionii
5   job.pdb  ...            Madurella mycetomatis
6   job.pdb  ...  Sporothrix schenckii ATCC 58251
7   job.pdb  ...  Schizosaccharomyces pombe 972h-
8   job.pdb  ...    Fonsecaea pedrosoi CBS 271.37
9   job.pdb  ...           Caenorhabditis elegans
10  job.pdb  ...    Histoplasma capsulatum G186AR
11  job.pdb  ...                         Zea mays

[12 rows x 21 columns]
===>Database: alis_afdb-swissprot.m8
         0   ...                               20
0   job.pdb  ...                     Homo sapiens
1   job.pdb  ...                   Macaca mulatta
2   job.pdb  ...                       Bos taurus
3   job.pdb  ...                Rattus norvegicus
4   job.pdb  ...                     Mus musculus
5   job.pdb  ...                   Xenopus laevis
6   job.pdb  ...                     Mus musculus
7   job.pdb  ...                       Bos taurus
8   job.pdb  ...                     Homo sapiens
9   job.pdb  ...                Rattus norvegicus
10  job.pdb  ...                     Mus musculus
11  job.pdb  ...              Macaca fascicularis
12  job.pdb  ...                     Homo sapiens
13  job.pdb  ...           Caenorhabditis elegans
14  job.pdb  ...              Schistosoma mansoni
15  job.pdb  ...                       Bos taurus
16  job.pdb  ...                     Pongo abelii
17  job.pdb  ...   Saccharomyces cerevisiae S288C
18  job.pdb  ...  Schizosaccharomyces pombe 972h-
19  job.pdb  ...           Caenorhabditis elegans
20  job.pdb  ...                     Homo sapiens
21  job.pdb  ...                     Homo sapiens

[22 rows x 21 columns]
===>Database: alis_afdb50.m8
         0   ...                             20
0   job.pdb  ...                    Danio rerio
1   job.pdb  ...                    Danio rerio
2   job.pdb  ...                   Mus musculus
3   job.pdb  ...            Macaca fascicularis
4   job.pdb  ...  Periophthalmus magnuspinnatus
..      ...  ...                            ...
56  job.pdb  ...      Verrucomicrobia bacterium
57  job.pdb  ...           Caulobacter sp. FWC2
58  job.pdb  ...           Fusarium euwallaceae
59  job.pdb  ...                Jatropha curcas
60  job.pdb  ...            Tribolium castaneum

[61 rows x 21 columns]
===>Database: alis_cath50.m8
         0        1   ...       19                              20
0   job.pdb  3hd7A00  ...    10116               Rattus norvegicus
1   job.pdb  3hd7E00  ...    10116               Rattus norvegicus
2   job.pdb  1sfcE00  ...    10116               Rattus norvegicus
3   job.pdb  1sfcI00  ...    10116               Rattus norvegicus
4   job.pdb  1sfcA00  ...    10116               Rattus norvegicus
5   job.pdb  1kilA00  ...     9606                    Homo sapiens
6   job.pdb  2n1tA00  ...     9606                    Homo sapiens
7   job.pdb  5ccgG00  ...    10116               Rattus norvegicus
8   job.pdb  5ccgA00  ...    10116               Rattus norvegicus
9   job.pdb  5kj7A00  ...    10116               Rattus norvegicus
10  job.pdb  5cchA00  ...    10116               Rattus norvegicus
11  job.pdb  6ip1A00  ...     9913                      Bos taurus
12  job.pdb  1l4aA00  ...  1051067             Doryteuthis pealeii
13  job.pdb  2npsA00  ...    10116               Rattus norvegicus
14  job.pdb  4wy4A00  ...     9606                    Homo sapiens
15  job.pdb  3b5nA00  ...   559292  Saccharomyces cerevisiae S288C

[16 rows x 21 columns]
===>Database: alis_gmgcl_id.m8
        0   ...                                                 18
0  job.pdb  ...  YEVTNVSPDEITGDGPGFTDTEWDGDDVTASLPNPSEADDAAGVLD...
1  job.pdb  ...  LTTVSDEWCVSTCAAGCPPAASLWCRCEDVRAADAVPANQGAAAWG...
2  job.pdb  ...  IILSISNKQDTEKIQRESWNIWGTSQWYSTYTIMIKTDVDEYKIVE...

[3 rows x 19 columns]
===>Database: alis_mgnify_esm30.m8
        0   ...                                                 18
0  job.pdb  ...  MSIKYLLIGNPEDCEEIGHYPDRGASKTTAKEADKIFKKLSQSGIQ...
1  job.pdb  ...  MSIQYVLIGNPEDCEEIGHYPDRGASKSIAKEANQIFKKLSESGIK...
2  job.pdb  ...  MTSSSPYEYSAVARNTTILAQFANSNGNFDVLVTEILQKINIPENQ...
3  job.pdb  ...  MAVQYSSIYQGQDLLASKSNGSLPNNVKKLMDSIAIQAKPNDLACV...
4  job.pdb  ...  MASSSATTPACPSLRHVLIVRHDAAIREGTLLCEAWAAAVGTARTS...
5  job.pdb  ...  YLASDKTTGADVAIKEFFPRDYCGRAPDGSLAMSPGHNAGLVDTLK...
6  job.pdb  ...  LMFADGPLTQPPNLAALVRLAGTTAAVDRAIRLTEQQAGNLITTAA...
7  job.pdb  ...  MQEILPARGLARRRSLAASTGIGETTEMDQDSNQTSPGAVAAGPGS...
8  job.pdb  ...  APAASTAPAAPSGGPASACGPEAQGTTPEGRAERELLGALRARRAE...

[9 rows x 19 columns]
===>Database: alis_pdb100.m8
        0   ...                 20
0  job.pdb  ...  Rattus norvegicus
1  job.pdb  ...  Rattus norvegicus

[2 rows x 21 columns]

Results from each database contains 20 columns. Please see this issue comment for their explanations.

[
    'job_desc',
    'query','target','pident','alnlen','mismatch','gapopen','qstart','qend','tstart','tend','evalue','bits',
    'qlen','tlen','qaln','taln','tca','tseq',
    'taxid','taxname',
]