Modelling of ?-turns using Hidden Markov Model
Modelling of ?-turns using Hidden Markov Model
Nivedita Rao
Ms. Sunila Godara
Abstract— One of the major tasks in predicting the secondary structure of a protein is to find the ?-turns. Functional and structural traits of a globular protein can be better understood by the turns as they play an important role in it. ?-turns play an important part in protein folding. ?-turns constitute on an average of 25% of the residues in all protein chains and are the most usual form of non-repetitive structures. It is already known that helices and ?-sheets are among the most important keys in stabilizing the structures in proteins. In this paper we have used hidden Markov model (HMM) in order to predict the ?-turns in proteins based on amino acid composition and compared it with other existing methods.
Keywords- ?-turns, amino acid composition, hidden Markov model, residue.
I. Introduction
Bioinformatics has become a vital part of many areas of biology. In molecular biology, bioinformatics techniques such as signal processing or image processing allow mining of useful results from large volumes of raw data. In the field ofgeneticsandgenomics, it helps in sequencing and explaining genomes and their perceivedmutations. It plays an important part in the analysis of protein expression, gene expression and their regulation. It also helps in literal mining of biological prose and the growth of biological and gene ontologies for organizing and querying biological data. Bioinformatics tools aid in the contrast of genetic and genomic data and more commonly in the understanding of evolutionary facets of molecule based biology. At a more confederated level, bioinformatics helps in analyzing and categorizing the biological trails and networks that are an significant part of systems biology. In structural biology, bioinformatics helps in the understanding, simulation and modelling of RNA, DNA and protein structures as well as molecular bindings.
The advancements in genome has increased radically over the recent years, thus resulting in the explosive growth of biological data widening the gap between the number of protein sequences stored in the databases and the experimental annotation of their functions.
There are many types of tight turns. These turns may subject to the number of atoms form the turn [1]. Among them is ?-turn, which is one of the important components of protein structure as it plays an important part in molecular structure and protein folding. A ?-turn invokes four consecutive residues where the polypeptide chain bends back on itself for about 180 degrees [2].
Basically these chain reversals are the ones which provide a protein its globularity rather than linearity. Even ?-turns can be further classified into different types. According to Venkatachalam [3], ?-turns can be of 10 types based on phi, psi angles and also some other. Richardson[4] suggested only 6 distinct types(I,I’,II,II’,VIa and VIb) on the basis of phi, psi ranges, along with a new category IV. Presently, classification by Richardson is most widely used.
Turns can be considered as an important part in globular proteins in respect to its structural and functional view. Without the component of turns, a polypeptide chain cannot fold itself into a compressed structure. Also, turns normally occur on the visible surface of proteins and therefore it possibly represents antigenic locations or involves molecular recognition. Thus, due to the above reasons, the prediction of ?-turns in proteins becomes an important element of secondary structure prediction.
II. RELATED WORK
A lot of work has been done for the prediction of ?-turns. To determine chain reversal regions of a globular protein, Chou at al. [5] used conformational parameters. Chou at al. [6] has given a residue-coupled model in order to predict the ?-turns in proteins. Chou at al. [7] used sequence of tetra peptide. Chou [8] again predicted tight turns and their types in protein using amino acid residues. Guruprasad K at al. [9] predicted ?-turn and ?-turn in proteins using a new set of amino acid and hydrogen bond. Hutchinson at al. [10] created a program called PROMOTIF to identify and analyse structural motifs in proteins. Shepherd at al. [11] used neural networks to predict the location and type of ?-turns. Wilmot at al. [12] analysed and predicted different types of ?-turn in proteins using phi, psi angles and central residues. Wilmot at al. [13] proposed a new nomenclature GORBTURN 1.0 for predicting ?-turns and their distortions.
This study has used hidden Markov model to predict the ?-turns in the protein. HMM has been widely used as biological tools.
(a) (b)
Figure 1.1 (a) defines Type-I ?-turns and (b) defines Type-II ?-turns. The hydrogen bond is denoted by dashed lines. [14]
III. Materials and methods
A. Dataset
The dataset used in the experiment is a non-redundant dataset which was previously described by Guruprasad and Rajkumar [9]. This dataset contains around 426 non-homologous protein chains. All protein chains do not have more than 25% sequence similarity. It is basically to ensure that there is very little correlation in the training set. In this dataset, each protein chain contains at least one beta turn and has X-ray crystallography with resolution 2 or more.
The dataset shows there are mainly ten classes and other classes are made using the combination of these ten classes.
Table 1 Datasets Description [14]
No. of ?- proteins ( class a )
68
No. of ?- proteins (class b )
97
No. of ?- proteins/ ?- proteins (class c )
102
No. of ?- proteins + ?- proteins (class d)
86
No. of multiple domain proteins (class e )
9
No. of small proteins (class f )
2
No. of coiled proteins (class g)
22
No. of low resolution proteins (class h )
0
No. of peptides (class i )
0
No. of designed proteins (class j )
1
No. of proteins with both a and b classes
3
No. of proteins with both a and c classes
7
No. of proteins with both a and d classes
5
No. of proteins with both b and c classes
6
No. of proteins with both b and d classes
4
No. of proteins with both b and f classes
1
No. of proteins with both c and d classes
10
No. of proteins with both c and g classes
1
No. of proteins with both b, c and d classes
2
B. Hidden markov model
In our work, we have used the probabilistic feature of HMM for ?-turns prediction. A model is presumed that ruminate the protein sequence being generated with a stochastic process that alternates amid two hidden states: “turns” and “non-turns”. The HMM is trained using 20 protein sequences.
The probability transition matrix is 2?2 for two states: turns and non-turns. The probability emission matrix is considered as 2?20 as there are 2 states and 20 amino acids. We prepared our probability transition matrix and probability emission matrix according to the knowledge that we have for dataset that is the probability of ?-non-turns is more than ?-turns in a protein sequence and by considering probabilities of each residue as the parameter taken from Chou [7] for calculating the emission and transition matrix.
There are more than ten classes and this HMM model parameter is estimated in2 super states and the training was performed.
Let P be a protein sequence of length n, which can also be expressed as
Where ri is the amino acid residue at sequence position i. The sequence is considered to be generated from r1 to rn in hidden Markov model. The model is trained using Baum-Welch algorithm [15].
Baum-Welch algorithm is a standard method for finding the maximum likelihood estimation of HMMs, in which posterior probabilities were performed by using both forward and backward algorithms. These algorithms were used to compile the state transition probability and emission probability matrices.
The initial probabilities are calculated, taking into account a correlation between residues in different position. The most probable path is calculated using Viterbi algorithm [16] as it automatically segments the protein into its component regions.
The probability of residue in the protein sequence used to generate the emission matrix given by
Where, m is the total number that of residue in the protein sequence and n is the total number of residues in the protein sequence.
C. Accuracy measures
Once the prediction of ?-turns is performed using the hidden Markov model, the problem arises of finding an appropriate measure for the quality of the prediction. Four different scalar measures are used to assess the models performance [17]. These measures can be derived four different quantities:
TP (true positive), p, is the number of correctly classified ?-turn residues.
TN (true negative), n, is the number of correctly classified non-?-turn residues.
FP (false positive), m, is the number of non-?-turn residues incorrectly classified as ?-turn residues.
FN( false negative), o, is the number of ?-turn residues incorrectly classified as non-?-turn residues.
The predictive performance of the HMM model can be expressed by the following parameters:
Qtotal gives the percentage of correctly classified residues.
MCC (Matthews Correlation Coefficient) [18] is a measure that counts for both over and under- predictions.
Qpredicted , is the percentage of ?-turn predictions that are correct.
Qobserved is the percentage of observed ?-turns that are correctly predicted.
IV. results and discussions
A. Results
This model is used to predict the ?-turns and is based on hidden Markov model.
There are basically two classes: turns and non-turns. It is used to predict one protein sequence at a time. It has been observed that it performs better than some existing prediction methods.
B. Comparison with other methods
In order to examine of this method, it has been compared with other existing methods as shown in table 2.
For now, the comparison is done on a single protein sequence. The comparison is for protein sequence with PDB code 1ah7.
Figure 2 shows comparison of Qtotal using different algorithms. Figure 3 shows comparison of Qpredicted using different algorithms. Figure 4 shows comparison of Qobserved using different algorithms. Figure 5 shows comparison of MCC using different algorithms. The HMM based method shows better results than some of the already existing algorithms of the prediction.
Figure 2. comparison of Qtotal with different algorithms
Figure 4. comparison of Qobserved with different algorithms
Figure 3. comparison of QPredicted with different algorithms
Figure 5. comparison of MCC with different algorithms
Table 2 Comparison with other methods
Prediction method
Qtotal (in %)
Qpredicted (in %)
Qobserved (in %)
MCC
Chou-Fasman algorithm
Thornton’s algorithm
1-4 & 2-3 correlation model
Sequence coupled model
GORBTURN
HMM based method
56.5
62.2
53.7
51.8
77.6
54.2
26.7
30.9
24.2
21.4
40.8
28.5
76.1
82.6
69.6
58.7
43.5
67.3
0.22
0.31
0.15
0.07
0.28
0.27
V. conclusion
In this paper, we presented a way in which HMM can be used to predict ?-turns in a protein chain. Our method is used to predict turns and non-turns of single protein sequence at a time. The results thus obtained are better than some of the other existing methods. The performance of the ?-turns can further be improved by considering other techniques such as using predicted secondary structures and dihedral angles from multiple predictors or by using feature selection technique [19] or by considering combination of many features together. We can also combine different machine learning techniques together to improve the performance of the prediction.
References
Chou, Kuo-Chen. “Prediction of tight turns and their types in proteins.”Analytical biochemistry286.1 (2000): 1-16.
Chou, P.Y. and Fasman, G.D. (1974) Conformational parameters for amino acids in helical, beta-sheet and random coil regions calculated from proteins.Biochemistry, 13, 211-222.
Venkatachalam, C. M. “Stereochemical criteria for polypeptides and proteins. V. Conformation of a system of three linked peptide units.”Biopolymers6.10 (1968): 1425-1436.
Richardson, Jane S. “The anatomy and taxonomy of protein structure.” Advances in protein chemistry34 (1981): 167-339.
Chou, P. Y., and G. D. Fasman. “Prediction of beta-turns.”Biophysical journal 26.3 (1979): 367-383.
Chou, K.C. “Prediction of beta-turns” Journal of Peptide Research(1997): 120-144.
Chou, Kou-Chen, and James R. Blinn. “Classification and prediction of ?-turn types.“Journal of protein chemistry16.6 (1997): 575-595.
Chou, Kuo-Chen. “Prediction of tight turns and their types in proteins.”Analytical biochemistry286.1 (2000): 1-16.
Guruprasad, Kunchur, and Sasidharan Rajkumar. “Beta-and gamma-turns in proteins revisited: a new set of amino acid turn-type dependent positional preferences and potentials.”Journal of biosciences25.2 (2000): 143.
Hutchinson, E. Gail, and Janet M. Thornton. “PROMOTIF—a program to identify and analyze structural motifs in proteins.”Protein Science5.2 (1996): 212-220.
Shepherd, Adrian J., Denise Gorse, and Janet M. Thornton. “Prediction of the location and type of ?-turns in proteins using neural networks.”Protein Science8.5 (1999): 1045-1055.
Wilmot, C. M., and J. M. Thornton. “Analysis and prediction of the different types of ?-turn in proteins.”Journal of molecular biology203.1 (1988): 221-232.
Wilmot, C. M., and J. M. Thornton. “?-Turns and their distortions: a proposed new nomenclature.”Protein engineering3.6 (1990): 479-493.
Available from :http://imtech.res.in/raghava/betatpred/intro.html
Welch, Lloyd R. “Hidden Markov models and the Baum-Welch algorithm.”IEEE Information Theory Society Newsletter53.4 (2003): 10-13.
Lou, Hui-Ling. “Implementing the Viterbi algorithm.”Signal Processing Magazine, IEEE12.5 (1995): 42-52.
Fuchs, Patrick FJ, and Alain JP Alix. “High accuracy prediction of ?aˆ?turns and their types using propensities and multiple alignments.”Proteins: Structure, Function, and Bioinformatics59.4 (2005): 828-839.
Matthews, Brian W. “Comparison of the predicted and observed secondary structure of T4 phage lysozyme.”Biochimica et Biophysica Acta (BBA)-Protein Structure405.2 (1975): 442-451.
Saeys, Yvan, Inaki Inza, and Pedro Larranaga. “A review of feature selection techniques in bioinformatics.”bioinformatics23.19 (2007): 2507-2517.