Input¶
External tools¶
Protein sequences are only needed for running DeepHelicon, and are also used to generate a list of intermediate files before running DeepHelicon, as shown in Table 1.
Fasta
Protein sequences in the Fasta format are required. The file extension must be .fasta
for recognition of the software.
Table 1:External tools for generating intermediate files before running DeepHelicon.
Tool | Role | Function | Source |
---|---|---|---|
HHblits | Input | generating multiple sequence alignments | Remmert et al. (2011) |
CCMPred | Input | predictor of residue contacts | Seemayer et al. (2014) |
plmDCA | Input | predictor of residue contacts | Ekeberg et al. (2013) |
Gaussian DCA | Input | predictor of residue contacts | Baldassi et al. (2014) |
Freecontact | Input | predictor of residue contacts | Kaján et al. (2014) |
TMHMM | Input | predictor of transmembrane topologies | Krogh et al. (2001) |
Uniclust30 database | Intermediate | sequence database | Mirdita et al. (2016) |
Output files¶
DeepHelicon can return an output file with the suffix of .s1
(stage 1), .s2i1
(stage 2), .s2i2
(stage 2), .s2i3
(stage 2), .s2i4
(stage 2) or .deephelicon
depending on different models used at different stages.
This predictor returns predictions of inter-helical residue contacts in tansmembrane proteins. If non-transmembrane segment or <1 transmembrane segment is detected, the programe will not return final results. However, you can still utilise the intermediate results at stage 1 and 2 as stated in the paper Sun & Frishman (2020). Considering <1 helix detection by inside transmembrane topology predictor, we will consider extending our module to generate a file including entire results in the future work. DeepHelicon outputs results in two formats.
Deephelicon-format¶
Prediction results of interaction sites in tansmembrane proteins consist of 5 columns.
Table 2:DeepHelicon output format.
Position 1 | Residue 1 | Position 2 | Residue 2 | Probability |
---|---|---|---|---|
1 | S | 6 | L | 0.14790976 |
1 | S | 7 | R | 0.041100707 |
1 | S | 8 | W | 0.04841847 |
... | ... | ... | ... | ... |
170 | F | 176 | K | 0.05994133 |
171 | A | 176 | K | 0.07471807 |
CASP14 format¶
CASP14 output has 3 columns: positions of residue pairs and their contact probabilities.
Table 3:DeepHelicon output format.
Position 1 | Position 2 | Probability |
---|---|---|
1 | 6 | .148 |
1 | 7 | .041 |
1 | 8 | .048 |
... | ... | ... |
170 | 176 | .060 |
171 | 176 | .075 |
Example data¶
Users can download some example data and check an assortment of input files.
import deephelicon
deephelicon.predict.download_data(
url='https://github.com/2003100127/deephelicon/releases/download/example_data/example_data.zip',
sv_fpn='../../data/deephelicon/example_data.zip',
)
____ _ _ _ _
| _ \ ___ ___ _ __ | | | | ___| (_) ___ ___ _ __
| | | |/ _ \/ _ \ '_ \| |_| |/ _ \ | |/ __/ _ \| '_ \
| |_| | __/ __/ |_) | _ | __/ | | (_| (_) | | | |
|____/ \___|\___| .__/|_| |_|\___|_|_|\___\___/|_| |_|
|_|
06/04/2025 04:45:18 logger: =>Downloading starts...
06/04/2025 04:45:20 logger: =>downloaded.
- Remmert, M., Biegert, A., Hauser, A., & Söding, J. (2011). HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature Methods, 9(2), 173–175. 10.1038/nmeth.1818
- Seemayer, S., Gruber, M., & Söding, J. (2014). CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations. Bioinformatics, 30(21), 3128–3130. 10.1093/bioinformatics/btu500
- Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M., & Aurell, E. (2013). Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models. Physical Review E, 87(1). 10.1103/physreve.87.012707
- Baldassi, C., Zamparo, M., Feinauer, C., Procaccini, A., Zecchina, R., Weigt, M., & Pagnani, A. (2014). Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners. PLoS ONE, 9(3), e92721. 10.1371/journal.pone.0092721
- Kaján, L., Hopf, T. A., Kalaš, M., Marks, D. S., & Rost, B. (2014). FreeContact: fast and free software for protein contact prediction from residue co-evolution. BMC Bioinformatics, 15(1). 10.1186/1471-2105-15-85
- Krogh, A., Larsson, B., von Heijne, G., & Sonnhammer, E. L. L. (2001). Predicting transmembrane protein topology with a hidden markov model: application to complete genomes11Edited by F. Cohen. Journal of Molecular Biology, 305(3), 567–580. 10.1006/jmbi.2000.4315
- Mirdita, M., von den Driesch, L., Galiez, C., Martin, M. J., Söding, J., & Steinegger, M. (2016). Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Research, 45(D1), D170–D176. 10.1093/nar/gkw1081
- Marks, D. S., Colwell, L. J., Sheridan, R., Hopf, T. A., Pagnani, A., Zecchina, R., & Sander, C. (2011). Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS ONE, 6(12), e28766. 10.1371/journal.pone.0028766
- Sun, J., & Frishman, D. (2020). DeepHelicon: Accurate prediction of inter-helical residue contacts in transmembrane proteins by residual neural networks. Journal of Structural Biology, 212(1), 107574. https://doi.org/10.1016/j.jsb.2020.107574