JackHmmer

JackHmmer#

Multiple sequence alignments (MSAs) can support the evolutionary analysis of a protein and have been widely used in protein research. MSAs were usually generated for a sequence by running a few iterations of JackHmmer searches[1] against a protein database, such as UniRef.

In this tutorial, we will go through how we can generate the JackHmmer commands for further MSA generation.

Example usage#

First, let’s prepare identifiers of some transmembrane proteins and make them recognisable in Python, like this.

prots = [
    ['6e3y', 'E'],
    ['6rfq', 'S'],
    ['6t0b', 'm'],
]
import pandas as pd
df = pd.DataFrame(prots, columns=['prot', 'chain'])

We need to specify all parameters used for generating the JackHmmer commands. They mainly include jackhmmer_fp, fasta_fp, db_path, and sv_fp.

Please make sure that you have installed JackHmmer software. If not, please refer to this tutorial for how to install JackHmmer. In our case, an executable of the JackHmmer program is in ./hmmer3.1b2/bin/.

Then, you need to prepare a protein sequence database that can be used for JackHmmer searches. For example, you may use the UniClust database. After then, you can specify db_path as ./uniclust_2020.06/UniRef30_2020_06.

Please specify the path sv_fp that you want to save the MSAs in the a3m format.

fasta_fp = './data/fasta/'
jackhmmer_fp = '/home/hmmer3.1b2/bin/'
db_path ='./uniclust_2020.06/UniRef30_2020_06'
sv_fp ='./data/a3m/'

Let’s then generate the MSAs for these proteins defined above by using the following codes. Of course, there are many more parameters other than the four above.

import tmkit as tmk

for id in df.index:
    prot_name = df.loc[id, 'prot']
    seq_chain = df.loc[id, 'chain']
    tmk.msa.run_jackhmmer(
        jackhmmer_fp=jackhmmer_fp,
        fasta_fpn=fasta_fp + prot_name + seq_chain + '.fasta',
        sv_fpn=sv_fp + prot_name + seq_chain + '.sto',
        db_path=db_path,

        # additional parameters
        cpu=4,
        iteration=3,
        jhm_E=10,
        incE=1e-3,
        noali='',

        # if you won't do it on clusters, please give False to the parameter send2cloud
        send2cloud=False,
        cloud_cmd="",

        # send2cloud=True,
        # cloud_cmd="qsub -q all.q -N 'jsun'",
    )

Attributes#

Attribute	Description
`fasta_fp`	path where a protein Fasta file is placed
`hhblits_fp`	path where an executable of HHblits is placed (normally it is in `hhblits/bin`)
`db_path`	path where a protein sequence database is placed
`sv_fp`	path to where you want to save the MSAs in the a3m format
`cpu`	number of CPUs
`iteration`	number of iterations by a hidden Markov model
`maxfilter`	max number of hits allowed to pass 2nd prefilter (default=20000)
`realign_max`	realign maximum hits displayed hits with the max accuracy algorithm
`all`	do not filter the resulting MSA
`B`	maximum number of alignments in alignment list (default=500)
`Z`	maximum number of lines in summary hit list (default=500)
`e`	maximum E-value in summary and alignment list (default=1E+06)

Output#

Finally, you will see the following output showing the generated commands. It will tell JackHmmer how to search MSAs.

===>The current order: /home/hmmer3.1b2/bin/jackhmmer --cpu 4 -N 3 -E 10 --incE 0.001 --noali -A ./data/a3m/6e3yE.sto ./data/fasta/6e3yE.fasta ./uniclust_2020.06/UniRef30_2020_06
===>The current order: /home/hmmer3.1b2/bin/jackhmmer --cpu 4 -N 3 -E 10 --incE 0.001 --noali -A ./data/a3m/6rfqS.sto ./data/fasta/6rfqS.fasta ./uniclust_2020.06/UniRef30_2020_06
===>The current order: /home/hmmer3.1b2/bin/jackhmmer --cpu 4 -N 3 -E 10 --incE 0.001 --noali -A ./data/a3m/6t0bm.sto ./data/fasta/6t0bm.fasta ./uniclust_2020.06/UniRef30_2020_06