HHfilter#
Filtering multiple sequence alignments (MSAs) is important as it can technically reduce the workload of computation by removing redundant protein sequences. This can be done by the HHfilter method, a tool in HHblits. TMKit provides the interface to HHfilter.
In this tutorial, we will go through how we can generate the HHfilter commands for filtering MSAs, ensuring that no proteins sharing a certain sequence identity are preserved.
Example usage#
First, let’s prepare identifiers of some transmembrane proteins and make them recognisable in Python, like this.
prots = [
['6e3y', 'E'],
['6rfq', 'S'],
['6t0b', 'm'],
]
import pandas as pd
df = pd.DataFrame(prots, columns=['prot', 'chain'])
We need to specify all parameters used for generating the HHblits commands. They mainly include hhfilter_fp
, a3m_path
, and new_a3m_path
.
Please make sure that you have installed HHblits software. If not, please refer to this tutorial for how to install HHblits. In our case, an executable of the HHblits program is in ./hhblits/scripts/
.
Please specify the path new_a3m_path
that you want to save the filtered MSAs in the a3m format.
hhfilter_fp = './hhblits/bin/'
a3m_path = 'data/a3m/'
new_a3m_path = 'data/a3m/filter/'
Let’s then filter the MSAs in the a3m format by using the following codes. Here, we used a maximum pairwise sequence identity of 90% to filter a generated MSA file in a3m format, which ensures that no proteins sharing 90% sequence identity are preserved.
import tmkit as tmk
for id in df.index:
prot_name = df.loc[id, 'prot']
seq_chain = df.loc[id, 'chain']
tmk.msa.run_hhfilter(
hhfilter_fp=hhfilter_fp,
id=90,
a3m_fpn=a3m_path + prot_name + seq_chain + '.a3m',
new_a3m_fpn=new_a3m_path + prot_name + seq_chain + '.a3m',
# if you won't do it on clusters, please give False to the parameter send2cloud
send2cloud=False,
cloud_cmd="",
# send2cloud=True,
# cloud_cmd="qsub -q all.q -N 'jsun'",
)
Attributes#
Attribute |
Description |
---|---|
|
path where an executable of HHfilter is placed (normally it is in hhblits/bin) |
|
path where a protein a3m file is placed |
|
path to where you want to save a filtered MSA in the a3m format |
|
maximum pairwise sequence identity (def=90) |
See also
We encourage users to check the meaning of all parameters via this route. Please see here for better understanding the file-naming system.
Output#
Finally, you will see the following output showing the generated commands. It will tell HHfilter how to filter generated a3m files.
===>The current order: ./hhblits/bin/hhfilter -i data/a3m/6e3yE.a3m -o data/a3m/filter/6e3yE.a3m -id 90
===>The current order: ./hhblits/bin/hhfilter -i data/a3m/6rfqS.a3m -o data/a3m/filter/6rfqS.a3m -id 90
===>The current order: ./hhblits/bin/hhfilter -i data/a3m/6t0bm.a3m -o data/a3m/filter/6t0bm.a3m -id 90