Conservation
Entropy¶
In the context of multiple sequence alignments (MSA), entropy measures the variability or uncertainty at each position in the alignment. It quantifies how conserved or variable a particular position is across the aligned sequences, offering insight into which positions are likely to be functionally or structurally important.
We can first read a MSA in ALN format with pp.msa.msaparser
. Then, taking this MSA as input to pp.fpmsa.entropy
, we obtain a dictionary of entropy values.
Python
1 2 3 4 5 6 7 |
|
Output
{1: 0.1960388659924533, 2: 0.22426701608683547, 3: 0.1850087818623904, 4: 0.20286277711377826, 5: 0.18773405896608028, 6: 0.18177904812214024, 7: 0.22124858188044072, 8: 0.1964272886606503, 9: 0.18956787701644068, 10: 0.19008797269348587, ..., 279: 0.07534724091084649, 280: 0.07102837084165835, 281: 0.06073066166727346}
If we want to consider the information from a gap symbol '-', we can achieve it with pp.fpmsa.entropy_gap
.
Python
1 2 3 4 5 |
|
Output
{1: 0.3256135083150946, 2: 0.3724994507477074, 3: 0.30729293513478106, ..., 279: 0.12514905822699368, 280: 0.11797557031649158, 281: 0.10087144560680732}
Custom-conservation¶
We borrow a technical idea from 1 to build a custom-conservation score based on the above calculated entropy values.
Python
1 2 3 4 5 |
|
Output
{1: 0.891307540667243, 2: 0.8756566285859075, 3: 0.8974230982362708, ..., 279: 0.9582242180545316, 280: 0.9606187871466461, 281: 0.9663282842404544}
Mutual information¶
Mutual information (MI) is an information theory metric that quantifies how much information one random variable reveals about another. In the context of multiple sequence alignments (MSA), MI measures the statistical dependency between sequence positions, helping to identify co-evolving residues.
To calculate the MI between columns 1 and 2, we can do
Python
1 2 3 4 5 6 7 |
|
Output
0.20924659742293167
Jensen–Shannon Divergence¶
Jensen–Shannon Divergence (JSD) 2 is a useful metric in structual bioinformatics for evaluating the conservation of sequence positions in multiple sequence alignments (MSAs). By analyzing the distribution of amino acids or nucleotides at each position, JSD helps to identify conserved and variable regions, indicating areas of functional or structural significance.
A JSD file looks like
# ./CLEC2B_LOC113845378.clustal -- js_divergence - window_size: 3 - window lambda: 0.50 - background: blosum62 - seq. weighting: True - gap penalty: 1 - normalized: False
# align_column_number score column
0 -1000.000000 M-M-M-T----TET--TTTMM-M-----------------------------------------------------P-----------------------------------------------
1 -1000.000000 E-S-D-Q----QTE--QQQEE-E-----------------------------------------------------V----------------------G------------------------
2 -1000.000000 K-S-S-N----DGY--DNNPP-P-----------------------------------------------------P----------------------P------------------------
3 -1000.000000
...
205 -1000.000000 PN---S---P--VPF------R---------------L---LL---LLL------L----DD---V----A--E----------------------------------EME---L-EE-I----
206 -1000.000000 EP---P---R---VP----------------------S---SS---SSS------S----HH---T----E--K----------------------------------SCS---C-SS-M----
Python
1 2 3 4 5 6 7 |
|
The standalone
mode means the output returned from running its standalone package.
Output
alignment_col score seq
0 0 0.0 M-M-M-T----TET--TTTMM-M-----------------------...
1 1 0.0 E-S-D-Q----QTE--QQQEE-E-----------------------...
2 2 0.0 K-S-S-N----DGY--DNNPP-P-----------------------...
3 3 0.0 E-E-E-E----EEN--EEEAA-A-----------------------...
4 4 0.0 V-N-N-G----EKN--EGG---------------------------...
.. ... ... ...
202 202 0.0 KL---T---G--TTT------E-P---------D--DEGDDEES--...
203 203 0.0 II---P---S--PPP------Y-L---------S--SQ-QSRRK--...
204 204 0.0 SV---F---V--PVF------M-L-------------F-S-LL---...
205 205 0.0 PN---S---P--VPF------R---------------L---LL---...
206 206 0.0 EP---P---R---VP----------------------S---SS---...
[207 rows x 3 columns]
ConSurf¶
ConSurf 3 is designed to identify and visualize conserved regions within protein or nucleic acid sequences. By mapping conservation levels onto the three-dimensional structure of a protein, ConSurf aids in pinpointing functionally and structurally significant regions.
We tease out the information from it by
Python
1 2 3 4 5 6 7 |
|
The v1
mode means the output adopted by ConSurf before 2024.
Output
position amino acid score color exposed/buried structral/functional
0 1 M -1.409 9 e f
1 2 Y 0.136 5* e
2 3 S -0.426 6 e
3 4 F -0.115 5 b
4 5 V -1.201 8 b
.. ... ... ... ... ... ...
70 71 P 0.637 3* e
71 72 D 0.109 5* e
72 73 L -0.200 6* b
73 74 L -0.120 5* e
74 75 V -0.991 8 e f
[75 rows x 6 columns]
The original output of ConSurf of E protein is
0 1 2 3 4 5 6 ... 8 9 10 11 12 13 14
0 1 M -1.409 9 -1.875,-1.232 ... 9,8 e f 78/150 M,N
1 2 Y 0.136 5* -0.618, 0.577 ... 7,3 e 78/150 C,F,Y,L
2 3 S -0.426 6 -1.003,-0.074 ... 8,5 e 78/150 E,P,S,Y,D
3 4 F -0.115 5 -0.776, 0.279 ... 7,4 b 78/150 I,F,L
4 5 V -1.201 8 -1.582,-0.928 ... 9,7 b 78/150 V,F,Q
.. .. .. ... .. .. ... .. ... ... .. .. .. .. ... ...
70 71 P 0.637 3* -0.365, 1.225 ... 6,2 e 34/150 S,P,L
71 72 D 0.109 5* -0.776, 0.755 ... 7,3 e 34/150 D,E
72 73 L -0.200 6* -1.003, 0.419 ... 8,4 b 34/150 L,F
73 74 L -0.120 5* -1.003, 0.419 ... 8,4 e 34/150 I,L
74 75 V -0.991 8 -1.582,-0.618 ... 9,7 e f 33/150 V
[75 rows x 15 columns]
The columns from the left to the right are defined as:
Column definition
- POS: The position of the AA in the SEQRES derived sequence.
- SEQ: The SEQRES derived sequence in one letter code.
- SCORE: The normalized conservation scores.
- COLOR: The color scale representing the conservation scores (9 - conserved, 1 - variable).
- CONFIDENCE INTERVAL: When using the bayesian method for calculating rates, a confidence interval is assigned to each of the inferred evolutionary conservation scores.
- CONFIDENCE INTERVAL COLORS: When using the bayesian method for calculating rates. The color scale representing the lower and upper bounds of the confidence interval.
- B/E: Burried (b) or Exposed (e) residue.
- FUNCTION: functional (f) or structural (s) residue (f - highly conserved and exposed, s - highly conserved and burried).
- MSA DATA: The number of aligned sequences having an amino acid (non-gapped) from the overall number of sequences at each position.
- RESIDUE VARIETY: The residues variety at each position of the multiple sequence alignment.
-
Zeng B, Hönigschmid P, Frishman D. Residue co-evolution helps predict interaction sites in α-helical membrane proteins. J Struct Biol. 2019 May 1;206(2):156-169. doi: 10.1016/j.jsb.2019.02.009. Epub 2019 Mar 2. PMID: 30836197. https://doi.org/10.1016/j.jsb.2019.02.009 ↩
-
John A. Capra, Mona Singh, Predicting functionally important residues from sequence conservation, Bioinformatics, Volume 23, Issue 15, August 2007, Pages 1875–1882, https://doi.org/10.1093/bioinformatics/btm270 ↩
-
Barak Yariv, Elon Yariv, Amit Kessel, Gal Masrati, Adi Ben Chorin, Eric Martz, Itay Mayrose, Tal Pupko, and Nir Ben-Tal Using evolutionary data to make sense of macromolecules with a 'face-lifted' ConSurf. Protein Science 2023; DOI: 10.1002/pro.4582; PMID: 36718848. https://doi.org/10.1002/pro.4582 ↩