Prediction of amphipathic in-plane membrane anchors in monotopic proteins using
a SVM classifier.
BMC Bioinformatics. 2006 May 16;7(1):255
Sapay N, Guermeur Y, Deleage G.
BACKGROUND: Membrane proteins are estimated to represent about 25 % of
open reading frames in fully sequenced genomes. However, the experimental study
of proteins remains difficult. Considerable efforts have thus been made to
develop prediction methods. Most of these were conceived to detect transmembrane
helices in polytopic proteins. Alternatively, a membrane protein can be
monotopic and anchored via an amphipathic helix inserted in a parallel way to
the membrane interface, so-called in-plane membrane (IPM) anchors. This type of
membrane anchor is still poorly understood and no suitable prediction method is
currently available. RESULTS: We report here the "AmphipaSeeK" method developed
to predict IPM anchors. It uses a set of 21 reported examples of IPM anchored
proteins. The method is based on a pattern recognition Support Vector Machine
with a dedicated kernel and multiple alignments. CONCLUSIONS: AmphipaSeeK was
shown to be highly specific, in contrast with classically used methods (e.g.
hydrophobic moment). Additionally, it has been able to retrieve IPM anchors in
naively tested sets of transmembrane proteins (e.g. PagP). AmphipaSeek and the
list of the 21 IPM anchored proteins is available on NPS@, our protein sequence
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res. 1997 Sep 1;25(17):3389-3402
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health,
Bethesda, MD 20894, USA. email@example.com
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a
variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be
decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word
hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three
times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments
produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific
Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more
sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of
the BRCT superfamily.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap
penalties and weight matrix choice.
Nucleic Acids Res 1994 Nov 11;22(22):4673-4680
Thompson JD, Higgins DG, Gibson TJ
European Molecular Biology Laboratory, Heidelberg, Germany.
The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of
divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to down-weight
near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different
alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally
reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure.
Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up
of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.
Predicting coiled coils from protein sequences.
Science 1991 May 24;252(5010):1162-1164
Lupas A, Van Dyke M, Stock J
Department of Molecular Biology, Princeton University, NJ 08544.
The probability that a residue in a protein is part of a coiled-coil structure was assessed by comparison of its flanking sequences
with sequences of known coiled-coil proteins. This method was used to delineate coiled-coil domains in otherwise globular proteins,
such as the leucine zipper domains in transcriptional regulators, and to predict regions of discontinuity within coiled-coil
structures, such as the hinge region in myosin. More than 200 proteins that probably have coiled-coil domains were identified in
GenBank, including alpha- and beta-tubulins, flagellins, G protein beta subunits, some bacterial transfer RNA synthetases, and members
of the heat shock protein (Hsp70) family.
An algorithm for protein secondary structure prediction based on class prediction.
Protein Eng 1987 Aug;1(4):289-294
Deleage G, Roux B
Laboratoire de Physico-Chimie Biologique, LBTM-CNRS UM 24, Universite Claude Bernard, Villeurbanne, France.
An algorithm has been developed to improve the success rate in the prediction of the secondary structure of proteins by taking into
account the predicted class of the proteins. This method has been called the 'double prediction method' and consists of a first
prediction of the secondary structure from a new algorithm which uses parameters of the type described by Chou and Fasman, and the
prediction of the class of the proteins from their amino acid composition. These two independent predictions allow one to optimize
the parameters calculated over the secondary structure database to provide the final prediction of secondary structure. This method
has been tested on 59 proteins in the database (i.e. 10,322 residues) and yields 72% success in class prediction, 61.3% of residues
correctly predicted for three states (helix, sheet and coil) and a good agreement between observed and predicted contents in secondary
Identification and application of the concepts important for accurate and reliable protein secondary structure prediction
Protein Sci 1996 Nov;5(11):2298-310
King RD, Sternberg MJ
Biomolecular Modelling Laboratory, Imperial Cancer Research Fund, London, United Kingdom.
A protein secondary structure prediction method from multiply aligned homologous sequences is presented with an overall
per residue three-state accuracy of 70.1%. There are two aims: to obtain high accuracy by identification of a set of concepts
important for prediction followed by use of linear statistics; and to provide insight into the folding process. The important
concepts in secondary structure prediction are identified as: residue conformational propensities, sequence edge effects,
moments of hydrophobicity, position of insertions and deletions in aligned homologous sequence, moments of conservation,
auto-correlation, residue ratios, secondary structure feedback effects, and filtering. Explicit use of edge effects, moments of
conservation, and auto-correlation are new to this paper. The relative importance of the concepts used in prediction was
analyzed by stepwise addition of information and examination of weights in the discrimination function. The simple and
explicit structure of the prediction allows the method to be reimplemented easily. The accuracy of a prediction is predictable
a priori. This permits evaluation of the utility of the prediction: 10% of the chains predicted were identified correctly as
having a mean accuracy of > 80%. Existing high-accuracy prediction methods are "black-box" predictors based on complex
nonlinear statistics (e.g., neural networks in PHD: Rost & Sander, 1993a). For medium- to short-length chains (> or = 90
residues and < 170 residues), the prediction method is significantly more accurate (P < 0.01) than the PHD algorithm
(probably the most commonly used algorithm). In combination with the PHD, an algorithm is formed that is significantly
more accurate than either method, with an estimated overall three-state accuracy of 72.4%, the highest accuracy reported for
any prediction method.
Dictionary of protein secondary structure : pattern recognition of hydrogen-bonded and geometrical features
Biopolymers 1983, 22: 2577-2637
Kabsch W & Sander C
Searching protein sequence libraries: comparison of the sensitivity and selectivity
of the Smith-Waterman and FASTA algorithms.
PNAS (1988) 85:2444-2448
Department of Biochemistry, University of Virginia, Charlottesville 22908.
The sensitivity and selectivity of the FASTA and the Smith-Waterman protein sequence comparison algorithms were evaluated using the
superfamily classification provided in the National Biomedical Research Foundation/Protein Identification Resource (PIR) protein
sequence database. Sequences from each of the 34 superfamilies in the PIR database with 20 or more members were compared against the
protein sequence database. The similarity scores of the related and unrelated sequences were determined using either the FASTA program
or the Smith-Waterman local similarity algorithm. These two sets of similarity scores were used to evaluate the ability of the two
comparison algorithms to identify distantly related protein sequences. The FASTA program using the ktup = 2 sensitivity setting
performed as well as the Smith-Waterman algorithm for 19 of the 34 superfamilies. Increasing the sensitivity by setting ktup = 1
allowed FASTA to perform as well as Smith-Waterman on an additional 7 superfamilies. The rigorous Smith-Waterman method performed
better than FASTA with ktup = 1 on 8 superfamilies, including the globins, immunoglobulin variable regions, calmodulins, and
plastocyanins. Several strategies for improving the sensitivity of FASTA were examined. The greatest improvement in sensitivity was
achieved by optimizing a band around the best initial region found for every library sequence. For every superfamily except the
globins and immunoglobulin variable regions, this strategy was as sensitive as a full Smith-Waterman. For some sequences, additional
sensitivity was achieved by including conserved but nonidentical residues in the lookup table used to identify the initial region.
Improved tools for biological sequence comparison.
Pearson WR, Lipman DJ
Department of Biochemistry, University of Virginia, Charlottesville 22908.
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data
bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more
sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein
sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the
calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of
related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that
preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with
scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be
displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow
comparison of DNA orprotein sequences based on a variety of alternative scoring matrices.
A flexible motif search technique based on generalized profile
Comput Chem 1996 Mar;20(1):3-23
Bucher P, Karplus K, Moeri N, Hofmann K
Swiss Institute for Experimental Cancer Research, Epalinges, Switzerland.
A flexible motif search technique is presented which has two major components: (1) a generalized profile syntax serving as a motif
definition language; and (2) a motif search method specifically adapted to the problem of finding multiple instances of a motif in
the same sequence. The new profile structure, which is the core of the generalized profile syntax, combines the functions of a variety
of motif descriptors implemented in other methods, including regular expression-like patterns, weight matrices, previously used profiles,
and certain types of hidden Markov models (HMMs). The relationship between generalized profiles and other biomolecular motif descriptors
is analyzed in detail, with special attention to HMMs. Generalized profiles are shown to be equivalent to a particular class of HMMs,
and conversion procedures in both directions are given. The conversion procedures provide an interpretation for local alignment in the
framework of stochastic models, allowing for clear, simple significance tests. A mathematical statement of the motif search problem defines
the new method exactly without linking it to a specific algorithmic solution. Part of the definition includes a new definition of disjointness
Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins.
J Mol Biol 1978 Mar 25;120(1):97-120
Garnier J, Osguthorpe DJ, Robson B
1) Co-operation between a laboratory interested in developing the theory for protein secondary structure prediction methods and a
laboratory interested in applying and comparing such methods has led to the development of a simple predictive algorithm.
2) 2) Four-state predictions, in which each residue is unambiguously assigned one conformational state of a-helix, extended chain,
reverse turn or coil, predict 49% of residue states correctly (in a sample of 26 proteins) when the overall helix and estended-chain
content is not taken into account.
3) When the relative abundances of helix, extended chain, reverse turn and coil observed by X-ray crystallography are tajen into
account, a single constant for each protein and type of conformation can be used to bias the prediction. When predictions are
optimized in this way, 63% of all residue states are unambiguously and correctly assigned.
4) By analysing the nature of the bias required, proteins can be classified into helix-rich types, pleated-sheet-rich types, and so
on. It is shown that, if the type of protein can be determined even approximately by circular dichroism, 57% of residue states can be
correctly predicted without taking into account the X-ray structure. Further, comparable predictions can be obtained if, instead of
circular dichroism, preliminary predictions are made to assess the protein type.
5) It is emphasized that the numbers quoted here depend on the method used to assess accuracy, and the algorithm is shown to be at
least as good as, and usually superior to, the reported predictions methods assessed in the same way.
6) Ways of further enhancing predictions by the use of additional information from hydrophobic triplets and homologous sequences are
also explored. Hydrophobic triplet information does not significantly improve predictive power and it is concluded that this
information is used by proteins in the next stage of folding. On the other hand, the use of homologous sequences appears to be very
7) The implication of these results in protein folding is discussed.
Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs.
J Mol Biol 1987 Dec 5;198(3):425-443
Gibrat JF, Garnier J, Robson B
Laboratoire de Biochimie-Physique, INRA, Universite de Paris-Sud, Orsay, France.
We have re-evaluated the information used in the Garnier-Osguthorpe-Robson (GOR) method of secondary structure prediction with the
currently available database. The framework of information theory provides a means to formulate the influence of local sequence upon
the conformation of a given residue, in a rigorous manner. However, the existing database does not allow the evaluation of parameters
required for an exact treatment of the problem. The validity of the approximations drawn from the theory is examined. It is shown that
the first-level approximation, involving single-residue parameters, is only marginally improved by an increase in the database. The
second-level approximation, involving pairs of residues, provides a better model. However, in this case the database is not big enough
and this method might lead to parameters with deficiencies. Attention is therefore given to overcoming this lack of data. We have
determined the significant pairs and the number of dummy observations necessary to obtain the best result for the prediction. This new
version of the GOR method increases the accuracy of prediction by 7%, bringing the amount of residues correctly predicted to 63% for
three states and 68 proteins, each protein to be predicted being removed from the database and the parameters derived from the other
proteins. If the protein to be predicted is kept in the database the accuracy goes up to 69.7%.
GOR secondary structure prediction method version IV
Methods in Enzymology 1996 R.F. Doolittle Ed., vol 266, 540-553
Garnier J, Gibrat J-F, Robson B
GOR:The GOR method is based on information theory and was developed by J.Garnier, D.Osguthorpe and B.Robson (J.Mol.Biol.120,97, 1978).
The present version, GOR IV, uses all possible pair frequencies within a window of 17 amino acid residues and is reported by J.
Garnier. J.F. Gibrat and B.Robson in Methods in Enzymology, vol 266, p 540-553 (1996). After crossvalidation on a data base of 267
proteins, the version IV of GOR has a mean accuracy of 64.4% for a three state prediction (Q3). The program gives two outputs, one
eye-friendly giving the sequence and the predicted secondary structure in rows, H=helix, E=extended or beta strand and C=coil; the
second gives the probability values for each secondary structure at each amino acid position. The predicted secondary structure is
the one of highest probability compatible with a predicted helix segment of at least four residues and a predicted extended segment
of at least two residues.
Profile hidden Markov models
Department of Genetics, Washington University School of Medicine, St Louis, USA.
The recent literature on profile hidden Markov model (profile HMM) methods and software is reviewed. Profile HMMs turn a multiple
sequence alignment into a position-specific scoring system suitable for searching databases for remotely homologous sequences.
Profile HMM analyses complement standard pairwise comparison methods for large-scale sequence analysis. Several software implementations
and two large libraries of profile HMMs of common protein domains are available. HMM methods performed comparably to threading methods in
the CASP2 structure prediction exercise.
Combinaison de classifieurs statistiques, Application a la prediction de structure secondaire des proteines
Model combination has recently been at the origine of significant improvements in the field of statistical learning, both for
regression and pattern recognition tasks. However, fundamental questions have remained virtually untackled. Few criteria have thus
been developed to motivate the choice of a specific method, whereas no independent result has been derived in the field of
discrimination. This dissertation deals with one of the most commonly used combination techniques: linear regression. We first
characterize the regularizing effect of the "stacked regression" method introduced by Breiman. We then study the application of the
multivariate linear regression model to the combination of discriminant experts the outputs of which are estimates of th class
posterior probabilities. This question is successively considered from the point of view of optimization and complexity control.
The latter point involves the computation of generalized Vapnik-Chervonenkis dimensions. The study is followed up with the description
of a non parametric method fo Bayes' error rate estimation. Our ensemble method is assessed on an open biological sequence processing
problem: the problem of globular protein secondary structure prediction. To perform this discrimination task, we introduce a
hierarchical and modular approach in which combination is used at an intermediate level.
Helix-turn-helix DNA-binding motifs prediction
Improved detection of helix-turn-helix DNA-binding motifs in protein sequences.
Nucleic Acids Res 1990 Sep 11;18(17):5019-5026
Dodd IB, Egan JB
Department of Biochemistry, University of Adelaide, Australia.
We present an update of our method for systematic detection and evaluation of potential helix-turn-helix DNA-binding motifs in
protein sequences [Dodd, I. and Egan, J. B. (1987) J. Mol. Biol. 194, 557-564]. The new method is considerably more powerful,
detecting approximately 50% more likely helix-turn-helix sequences without an increase in false predictions. This improvement is due
almost entirely to the use of a much larger reference set of 91 presumed helix-turn-helix sequences. The scoring matrix derived from
this reference set has been calibrated against a large protein sequence database so that the score obtained by a sequence can be used
to give a practical estimation of the probability that the sequence is a helix-turn-helix motif.
Improved Performance in Protein Secondary Structure Prediction by Inhomogeneous Score Combination
Bioinformatics vol. 15 no. 5 1999 pp 413-421
Guermeur Y, Geourjon C, Gallinari P, & Deleage G
In many fields of pattern recognition, combination has proved efficient to increase the generalization performance of individual
prediction methods. Numerous systems have been developed for protein secondary structure prediction, based on different principles.
Finding better ensemble methods for this task may thus become crucial. In addition, efforts need to be made to help the biologist in
the post-processing of the outputs.
An ensemble method has been designed to post-process the outputs of protein secondary structure prediction methods, in order to obtain
an improvement of prediction accuracy while generating class posterior probability estimates. Experimental results establish that it
can increase the recognition rate of methods that provide inhomogeneous scores, even if their individual prediction successes are
largely different. This combination thus contsitutes an help for the biologist, who can use it confidently on top of any set of
prediction methods. Furthermore, the resulting estimates can be used in various ways, for instance to determine which residues are
predicted with a given high level of reliability.
Free availability over the internet on the Network Protein Sequence @nalysis (NPS@) WWW server at https://npsa-prabi.ibcp.fr/NPSA/npsa_mlrc.html. The method is proposed as the default choice.
Neural networks and ensemble method : Yann.Guermeur@ens-lyon.fr, server and software : firstname.lastname@example.org.
MPSA: integrated system for multiple protein sequence analysis with client/server capabilities.
Bioinformatics 2000 Mar;16(3):286-7
Blanchet C, Combet C, Geourjon C, Deleage G
Summary: MPSA is a stand-alone software intended to protein sequence analysis with a high integration level and Web clients/server
capabilities. It provides many methods and tools, which are integrated into an interactive graphical user interface. It is available for
most Unix/Linux and non-Unix systems. MPSA is able to connect to a Web server (e.g. http://mpsa-pbil.ibcp.fr/) in order to perform
large-scale sequence comparison on up-to-date databanks. Availability: Free to academic http://mpsa-pbil.ibcp.fr/ Contact:
Multiple sequence alignment with hierarchical clustering.
Nucleic Acids Res 1988 Nov 25;16(22):10881-10890
Laboratoire de Genetique Cellulaire, INRA Toulouse, France.
An algorithm is presented for the multiple alignment of sequences, either proteins or nucleic acids, that is both accurate and easy to
use on microcomputers. The approach is based on the conventional dynamic-programming method of pairwise alignment. Initially, a
hierarchical clustering of the sequences is performed using the matrix of the pairwise alignment scores. The closest sequences are
aligned creating groups of aligned sequences. Then close groups are aligned until all sequences are aligned in one group. The pairwise
alignments included in the multiple alignment form a new matrix that is used to produce a hierarchical clustering. If it is different
from the first one, iteration of the process can be performed. The method is illustrated by an example: a global alignment of 39
sequences of cytochrome c.
NPS@: Network Protein Sequence Analysis
TIBS 2000 March Vol. 25, No 3 :147-150
Combet C., Blanchet C., Geourjon C. and Deléage G.
P-SEA: a new efficient assignment of secondary structure from C alpha trace of proteins.
Comput Appl Biosci 1997 Jun;13(3):291-5
Labesse G, Colloc'h N, Pothier J, Mornon JP
MOTIVATION: The secondary structure is a key element of architectural organization in proteins. Accurate assignment of the secondary
structure elements (SSE) (helix, strand, coil) is an essential step for the analysis and modelling of protein structure. Various
methods have been proposed to assign secondary structure. Comparative studies of their results have shown some of their drawbacks,
pointing out the difficulties in the task of SSE assignment.
RESULTS: We have designed a new automatic method, named P-SEA, to assign efficiently secondary structure from the sole C alpha
position. Some advantages of the new algorithm are discussed.
AVAILABILITY: The program P-SEA is available by anonymous ftp: ftp.lmcp.jussieu.fr directory: pub/.
Prediction of protein secondary structure at better than 70% accuracy.
J Mol Biol 1993 Jul 20;232(2):584-99
Rost B, Sander C
European Molecular Biology Laboratory, Heidelberg, Germany.
We have trained a two-layered feed-forward neural network on a non-redundant data base of 130 protein chains to predict
the secondary structure of water-soluble proteins. A new key aspect is the use of evolutionary information in the form of
multiple sequence alignments that are used as input in place of single sequences. The inclusion of protein family information
in this form increases the prediction accuracy by six to eight percentage points. A combination of three levels of networks
results in an overall three-state accuracy of 70.8% for globular proteins (sustained performance). If four membrane protein
chains are included in the evaluation, the overall accuracy drops to 70.2%. The prediction is well balanced between
alpha-helix, beta-strand and loop: 65% of the observed strand residues are predicted correctly. The accuracy in predicting
the content of three secondary structure types is comparable to that of circular dichroism spectroscopy. The performance
accuracy is verified by a sevenfold cross-validation test, and an additional test on 26 recently solved proteins. Of particular
practical importance is the definition of a position-specific reliability index. For half of the residues predicted with a high
level of reliability the overall accuracy increases to better than 82%. A further strength of the method is the more realistic
prediction of segment length. The protein family prediction method is available for testing by academic researchers via an
electronic mail server.
Combining evolutionary information and neural networks to predict protein secondary structure.
Proteins 1994 May;19(1):55-72
Rost B, Sander C
European Molecular Biology Laboratory, Heidelberg, Germany.
Using evolutionary information contained in multiple sequence alignments as input to neural networks, secondary structure
can be predicted at significantly increased accuracy. Here, we extend our previous three-level system of neural networks by
using additional input information derived from multiple alignments. Using a position-specific conservation weight as part
of the input increases performance. Using the number of insertions and deletions reduces the tendency for overprediction and
increases overall accuracy. Addition of the global amino acid content yields a further improvement, mainly in predicting
structural class. The final network system has sustained overall accuracy of 71.6% in a multiple cross-validation test on 126
unique protein chains. A test on a new set of 124 recently solved protein structures that have no significant sequence
similarity to the learning set confirms the high level of accuracy. The average cross-validated accuracy for all 250
sequence-unique chains is above 72%. Using various data sets, the method is compared to alternative prediction methods,
some of which also use multiple alignments: the performance advantage of the network system is at least 6 percentage points
in three-state accuracy. In addition, the network estimates secondary structure content from multiple sequence alignments
about as well as circular dichroism spectroscopy on a single protein and classifies 75% of the 250 proteins correctly into one
of four protein structural classes. Of particular practical importance is the definition of a position-specific reliability index.
For 40% of all residues the method has a sustained three-state accuracy of 88%, as high as the overall average for homology
modelling. A further strength of the method is greatly increased accuracy in predicting the placement of secondary structure
A computer program for predicting protein antigenic determinants.
Mol Immunol 1983 Apr;20(4):483-489
Hopp TP, Woods KR
A computerized method for predicting the locations of protein antigenic determinants is presented, which requires only the amino acid
sequence of a protein, and no other information. This procedure has been used to predict the major antigenic determinant of the
hepatitis B surface antigen, as well as antigenic sites on a series of test proteins of known antigenic structure [Hopp & Woods (1981)
Proc. Nat. Acad. Sci. U.S.A. 78, 3824-3828.] The method is suitable for use in smaller personal computers, and is written in the BASIC
language, in order to make it available to investigators with limited computer experience and/or resources. A means of locating
multiple antigenic sites on a homologous series of proteins is demonstrated using the influenza hemagglutinin as an example.
A simple method for displaying the hydropathic character of a protein.
J Mol Biol 1982 May 5;157(1):105-132
Kyte J, Doolittle RF
A computer program that progressively evaluates the hydrophilicity and hydrophobicity of a protein along its amino acid sequence has
been devised. For this purpose, a hydropathy scale has been composed wherein the hydrophilic and hydrophobic properties of each of the
20 amio acid side-chains is taken into consideration. The scale is based on an amalgran of experimental observations derived from the
literature. The program uses a moving-segment approach that continuously determine the average hydropathy within a segment of
predetermined length as it advances through the sequence. The consecutive scores are plotted from the amino to the carboxy terminus.
At the same time, a midpoint line is printed that corresponds to the grand average of the hydropathy of the amino acid compositions
found in most of the sequenced proteins. In the case of soluble, globular proteins there is a remarkable correpondence between the
interior portions of their sequence and the regions appearing on the hydrophobic side of the midpoint line, as well as the exterior
portions and the regions on the hydrophilic side. The correlation was demonstrated by comparisons between the plotted values and known
structures determined by cristallography. In the case of membrane-bound proteins, the portions of their sequences that are located
within the lipid bilayer are also clearly delineated by large uninterrupted areas on the hydrophobic side of the midpoint line. As
such, the membrane-spannin segments of these proteins can be identified by this procedure. Although the method is not unique and
embodies principles that have long been appreciated, its simplicity and its graphic nature make it a very useful tool for the
evaluation of protein structures.
Prediction of chain flexibility in proteins
Naturwissens-chaften (1985),72, 212-213
Karplus, P.A. & Schulz, G.E
No summary available yet
New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted
surface residues with antigenicity and X-ray-derived accessible sites.
Biochemistry 1986 Sep 23;25(19):5425-5432
Parker JM, Guo D, Hodges RS
A new set of hydrophilicity high-performance liquid chromatography (HPLC) parameters is presented. These parameters were derived from
the retention times of 20 model synthetic peptides, Ac-Gly-X-X-(Leu)3-(Lys)2-amide, where X was substituted with the 20 amino acids
found in proteins. Since hydrophilicity parameters have been used extensively in algorithms to predict which amino acid residues are
antigenic, we have compared the profiles generated by our new set of hydrophilic HPLC parameters on the same scale as nine other sets
of parameters. Generally, it was found that the HPLC parameters obtained in this study correlated best with antigenicity. In addition,
it was shown that a combination of the three best parameters for predicting antigenicity further improved the predictions. These
predicted surface sites or, in other words, the hydrophilic, accessible, or mobile regions were then correlated to the known antigenic
sites from immunological studies and accessible sites determined by X-ray crystallographic data for several proteins.
Structural prediction of membrane-bound proteins.
Eur J Biochem 1982 Nov 15;128(2-3):565-575
Argos P, Rao JK, Hargrave PA
A prediction algorithm based on physical characteristics of the twenty amino acids and refined by comparison to the proposed
bacteriorhodopsin structure was devised to delineate likely membrane-buried regions in the primary sequences of proteins known to
interact with the lipid bilayer. Application of the method to the sequence of the carboxyl terminal one-third of bovine rhodopsin
predicted a membrane-buried helical hairpin structure. With the use of lipid-buried segments in bacteriorhodopsin as well as regions
predicted by the algorithm in other membrane-bound proteins, a hierarchical ranking of the twenty amino acids in their preferences to
be in lipid contact was calculated. A helical wheel analysis of the predicted regions suggests which helical faces are within the
protein interior and which are in contact with the lipid bilayer.
Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence.
Protein Eng 1996 Feb;9(2):133-142
Frishman D, Argos P
European Molecular Biology Laboratory, Heidelberg, Germany.
Existing approaches to protein secondary structure prediction from the amino acid sequence usually rely on the statistics of local
residue interactions within a sliding window and the secondary structural state of the central residue. The practically achieved
accuracy limit of such single residue and single sequence prediction methods is 65% in three structural stages (alpha-helix,
beta-strand and coil). Further improvement in the prediction quality is likely to require exploitation of various aspects of
three-dimensional protein architecture. Here we make such an attempt and present an accurate algorithm for secondary structure
prediction based on recognition of potentially hydrogen-bonded residues in a single amino acid sequence. The unique feature of our
approach involves database-derived statistics on residue type occurrences in different classes of beta-bridges to delineate
interacting beta-strands. The alpha-helical structures are also recognized on the basis of amino acid occurrences in hydrogen-bonded
pairs (i,i + 4). The algorithm has a prediction accuracy of 68% in three structural stages, relies only on a single protein sequence
as input and has the potential to be improved by 5-7% if homologous aligned sequences are also considered.
Secondary consensus prediction
Protein structure prediction. Implications for the biologist.
Biochimie 1997 Nov;79(11):681-686
Deleage G, Blanchet C, Geourjon C
Institute of Biology and Chemistry of Proteins, Lyon, France.
Recent improvements in the prediction of protein secondary structure are described, particularly those methods using the information
contained into multiple alignments. In this respect, the prediction accuracy has been checked and methods that take into account
multiple alignments are 70% correct for a three-state description of secondary structure. This quality is obtained by a 'leave-one
out' procedure on a reference database of proteins sharing less than 25% identity. Biological applications such as 'protein domain
design' and structural phylogeny are given. The biologist's point of view is also considered and joint predictions are encouraged in
order to derive an amino acid based accuracy. All the tools described in this paper are available for biologists on the Web
An algorithm for secondary structure determination in proteins based on sequence similarity.
FEBS Lett 1986 Sep 15;205(2):303-308
Levin JM, Robson B, Garnier J
A secondary structure prediction algorithm is proposed on the hypothesis that short homologous sequences of amino acids have the same
secondary structure tendencies. Comparisons are made with the secondary structure assignments of Kabsch and Sander from X-ray data
[(1983) Biopolymers 22, 2577-2637] and an empirically determined similarity matrix which assigns a sequence similarity score between
any two sequences of 7 residues in length. This similarity matrix differs in many respects from that of the Dayhoff substitution
matrix [(1978) in: Atlas of Protein Sequence and Structure, (Dayhoff, M.O. ed). vol. 5. suppl. 3, pp. 353-358, National Biochemical
Research Foundation, Washington, DC]. This homologue method had a prediction accuracy of 62.2% over 3states for 61 proteins and 63.6%
for a new set of 7 proteins not in the original data base.
Exploring the limits of nearest neighbour secondary structure prediction.
Protein Eng. (1997),7, 771-776
SIMPA is a nearest neighbour method for predicting secondary structures using a similarity matrix, in its latest version the BLOSUM
62, an optimized similarity threshold, a window of 13 to 17 residues and a database of observed secondary structures. In version
simpa96 used here, the database contains circa 300 proteins and the window is 13 residues long. Its crossvalidated accuracy was a Q3
of 67.7% for a single sequence and 72.8% when using multiple alignments of homologous sequences.
- J. LEVIN, B. ROBSON, J. GARNIER. An Algorithm for secondary structure determination in proteins based on sequence similarity.
FEBS, 205, (1986) 303-308. This describes the basic algorithm.
- J. LEVIN, J. GARNIER. Improvements in a secondary structure prediction method based on a search for local sequence homologies and
its use as a model building tool. Biochim. Biophys. Acta, (1988) 955, 283-295. Here the window and threshold are optimized and the
results are crossvalidated by jack knife process.
- J. LEVIN. Exploring the limits of nearest neighbour secondary structure prediction. Protein Eng. (1997),7, 771-776 This
corresponds to simpa96.
SOPM: a self-optimized method for protein secondary structure prediction.
Protein Eng 1994 Feb;7(2):157-164
Geourjon C, Deleage G
Institut de Biologie et de Chimie des Proteines, UPR 412-CNRS, Lyon, France.
A new method called the self-optimized prediction method (SOPM) has been developed to improve the success rate in the prediction of
the secondary structure of proteins. This new method has been checked against an updated release of the Kabsch and Sander database,
'DATABASE.DSSP', comprising 239 protein chains. The first step of the SOPM is to buildsub-databases of protein sequences and their
known secondary structures drawn from 'DATABASE.DSSP' by (i) making binary comparisons of all protein sequences and (ii) taking into
account the prediction of structural classes of proteins. The second step is to submit each protein of the sub-database to a secondary
structure prediction using a predictive algorithm based on sequence similarity. The third step is to iteratively determine the
predictive parameters that optimize the prediction quality on the whole sub-database. The last step is to apply the final parameters
to the query sequence. This new method correctly predicts 69% of amino acids for a three-state description of the secondary structure
(alpha helix, beta sheet and coil) in the whole database (46,011 amino acids). The correlation coefficients are C alpha = 0.54, C beta
= 0.50 and Cc = 0.48. Root mean square deviations of 10% in the secondary structure content are obtained. Implications for the users
are drawn so as to derive an accuracy at the amino acid level and provide the user with a guide for secondary structure prediction.
The SOPM method is available by anonymous ftp to ibcp.fr.
SOPMA: significant improvements in protein secondary structure prediction by consensus prediction from multiple alignments.
Comput Appl Biosci 1995 Dec;11(6):681-684
Geourjon C, Deleage G
Institut de Biologie et de Chimie des Proteines, UPR 412-CNRS, Lyon, France.
Recently a new method called the self-optimized prediction method (SOPM) has been described to improve the success rate in the
prediction of the secondary structure of proteins. In this paper we report improvements brought about by predicting all the sequences
of a set of aligned proteins belonging to the same family. This improved SOPM method (SOPMA) correctly predicts 69.5% of amino acids
for a three-state description of the secondary structure (alpha-helix, beta-sheet and coil) in a whole database containing 126 chains
of non-homologous (less than 25% identity) proteins. Joint prediction with SOPMA and a neural networks method (PHD) correctly predicts
82.2% of residues for 74% of co-predicted amino acids. Predictions are available by Email to email@example.com or on a Web page
Identification of common molecular subsequences.
J. Mol. Biol. (1981) 147:195-197
Smith TF, Waterman MS
No summary available yet
Knowledge-based secondary structure assignment
Proteins: structure, function and genetics (1995), 23, 566-579
Frishman D & Argos P
Transmembrane helices prediction
Transmembrane helices predicted at 95% accuracy.
Protein Sci 1995 Mar;4(3):521-33
Rost B, Casadio R, Fariselli P, Sander C
Protein Design Group, EMBL Heidelberg, Germany.
We describe a neural network system that predicts the locations of transmembrane helices in integral membrane proteins. By using
evolutionary information as input to the network system, the method significantly improved on a previously published neural network
prediction method that had been based on single sequence information. The input data were derived from multiple alignments for each
position in a window of 13 adjacent residues: amino acid frequency, conservation weights, number of insertions and deletions, and
position of the window with respect to the ends of the protein chain. Additional input was the amino acid composition and length of the
whole protein. A rigorous cross-validation test on 69 proteins with experimentally determined locations of transmembrane segments
yielded an overall two-state per-residue accuracy of 95%. About 94% of all segments were predicted correctly. When applied to known
globular proteins as a negative control, the network system incorrectly predicted fewer than 5% of globular proteins as having
transmembrane helices. The methodwas applied to all 269 open reading frames from the complete yeast VIII chromosome. For 59 of these,
at least two transmembrane helices were predicted. Thus, the prediction is that about one-fourth of all proteins from yeast VIII contain
one transmembrane helix, and some 20%, more than one.
The PROSITE database, its status in 1997.
Nucleic Acids Res. (1997)Jan 1;25(1):217-221
Bairoch A, Bucher P, Hofmann K
Department of Medical Biochemistry, University of Geneva, 1 rue Michel Servet 1211 Geneva 4, Switzerland. firstname.lastname@example.org
The PROSITE database consists of biologically significant patterns and profiles formulated in such a way that with appropriate
computational tools it can help to determine to which known family of protein (if any) a new sequence belongs, or which known
domain(s) it contains.
The SWISS-PROT protein sequence data bank and its supplement TrEMBL.
Nucleic Acids Res 1997 Jan 1;25(1):31-36
Bairoch A, Apweiler R
Department of Medical Biochemistry, University of Geneva, 1 rue Michel Servet, 1211 Geneva 4, Switzerland. email@example.com
SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the
function of a protein, structure of its domains, post-translational modifications, variants, etc.), a minimal level of redundancy and
high level of integration with other databases. Recent developments of the database include: an increase in the number and scope of
model organisms; cross-references to two additional databases; a variety of new documentation files and the creation of TrEMBL, a
computer annotated supplement to SWISS-PROT. This supplement consists of entries in SWISS-PROT-like format derived from the
translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except the CDS already included in SWISS-PROT.
Identification of related proteins with weak sequence identity using secondary structure information.
Protein Sci 2001 Apr;10(4):788-97
Geourjon C, Combet C, Blanchet C, Deleage G
Molecular modeling of proteins is confronted with the problem of finding homologous proteins, especially when few identities remain after the process of molecular evolution. Using even the most recent methods based on sequence identity detection, structural relationships are still difficult to establish with high reliability. As protein structures are more conserved than sequences, we investigated the possibility of using protein secondary structure comparison (observed or predicted structures) to discriminate between related and unrelated proteins sequences in the range of 10%-30% sequence identity. Pairwise comparison of secondary structures have been measured using the structural overlap (Sov) parameter. In this article, we show that if the secondary structures likeness is >50%, most of the pairs are structurally related. Taking into account the secondary structures of proteins that have been detected by BLAST, FASTA, or SSEARCH in the noisy region (with high E: value), we show that distantly related protein sequences (even with <20% identity) can be still identified. This strategy can be used to identify three-dimensional templates in homology modeling by finding unexpected related proteins and to select proteins for experimental investigation in a structural genomic approach, as well as for genome annotation.