Bioinformatics Group, Swiss Institute for Experimental Cancer Research, CH-1066 Epalinges s/Lausanne, Switzerland.
(INTERNET: khofmann@isrec-sun1.unil.ch ; pbucher@isrec-sun1.unil.ch)
The PROSITE pattern library has proved to be a useful tool for the detection of motifs in protein sequences. Both consensus sequences for posttranslational modifications and `signatures' characteristic for certain protein families are documented and stored in a searchable format. However, not all characterised protein motifs and domains can be described using the PROSITE syntax, which is based on regular expressions. Recently, we started to add profile entries to PROSITE, using the `generalised profile syntax', which had been designed for that purpose. Profiles are two-dimensional tables of position specific match-, gap-, and insertion-scores, normally derived from aligned sequence families. Here we describe the methods we use for the generation of a profile entry for the PROSITE database.
The standard procedure starts with a set of sequences whose membership in the protein- or domain-family under study is well established. The sequences are aligned using any of the available multiple-alignment methods. It has proved to be very advantageous to include information on the 3D-structure of one or more family members in the alignment process, if it is available. In many cases the alignment can be significantly improved by manual editing. Subsequently, a profile is constructed from the multiple alignment, using sequence weighting to account for subfamily bias, and limited gap-excision. The profile is first compared with every sequence in a randomised protein database derived from SwissProt by regional shuffling in order to analyse the score statistics and to calculate the necessary scaling parameters. Finally, the profile is used to search a nonredundant database of all available protein sequences. The scaling parameters are used to estimate the significance of an observed score, database sequences with scores exceeding a certain confidence threshold are tentatively accepted as members of the sequence family. After checking for biological plausibility, this newly derived set of sequences is used for the construction of a second profile, essentially as described above. The whole iterative process is continued until no new sequences fulfil the imposed significance criteria. For the inclusion of the profile into PROSITE, appropriate cut-off scores are derived from the statistical parameters and the relevant sequences in SwissProt are classified into the categories `true positive', `false positive', `false negative', and `sequence fragment'.
While the methods described above are used for the construction of most profiles, in some cases we use the versatility of the generalised profile syntax to convert other motif descriptions into the profile format. We developed programs to import `classical' profiles as introduced by Gribskov and used in the GCG package, these profiles use only a subset of the generalised profile syntax. We also use the formal equivalence of profiles and Hidden Markov Models (HMMs) to interconvert these two formats and apply the profile search methods to models derived from HMM training procedures.
Up to now we have constructed about 40 different profiles for frequent and important protein domains. With the recent improvements in the profile construction methods, we expect this number to increase relatively fast although we intend to prepare rather a limited number of high quality profiles representing defined domains and protein families instead of a large number of automatically created profiles of poor quality.
[gif]
[Postscript]
The construction of each new profile entry begins with a set of sequences, either total proteins or local homology domains, which can be assumed to belong to the same family. It is important not to include sequences with doubtful relationship to the family under consideration since even a single inappropriate sequence can severely degrade profile performance. Criteria we accept for establishing the relationship between the sequences of the starting set include the following:
Several methods for aligning the proteins of the trusted set can be used for that purpose, depending on the amount of available information. If the 3D-structure of more than one sequence of the family is known, we usually start with a structural alignment of these sequences, derived from a superposition of the structures. If no structural data are available, we start with a multiple alignment generated by programs like ClustalW or Pileup. In most cases that include divergent proteins, a manual refinement of the initial multiple alignment is necessary. If some of the sequences are very divergent, it has proven advantageous to exclude these sequences from the initial alignment and add them at a later stage to the already existing multiple alignment.
As has been shown before by us[1] and others[2], the introduction of sequence weights improves the performance of the resulting profiles. This effect is particularly pronounced if the initial set of trusted sequences contains both unique sequences and multiple members from closely related sequence families. We usually perform the weighting on the pre-aligned sequences, using an algorithm introduced by Sibbald and Argos[3]. Briefly, this algorithm constructs random sequences from the repertoire of the original sequences and tests which of the proteins in the set is most closely related to the random sequence. After about 2000*N such trials, the number of hits that each of the N sequences in the initial set has accumulated is counted. The weighting factor is then derived from these counts.
(1) Lüthy R., Xenarios I., Bucher P. Prot. Sci. 3:139-146 (1994).
(2) Thompson J.D., Higgins D.G., Gibson T.J. CABIOS 10:19-29 (1994).
(3) Sibbald P, Argos P. J. Mol. Biol. 216:813-818 (1990).
The generalised profile syntax for PROSITE entries has been described before[1],
a comprehensive description is part of the PROSITE documentation and is also
available electronically from http://ulrec3.unil.ch/profile/.
We convert weighted multiple alignments into generalised profiles by applying
PFMAKE, a newly developed program taking advantage of the advanced features of
the generalised profiles that are missing in the classical profiles as
introduced by Gribskov[2]. In the default procedure, we use a 10*log10-scaled
version of the BLOSUM45 comparison matrix[3], applying symmetrical gap-opening
and gap-closing penalties of 1.05 each, and a gap-extension penalty of 0.21 .
If no special requirements are apparent, we use limited gap-excision to exclude
regions in the alignment that are present in less than 50% of the sequences
from the profile. These gap-excisions are compensated for by lowering the
insert penalties at the excision boundaries, depending on the amount of excised
residues.
Depending on the purpose of the profile (i.e. if it reflects a complete protein
family or a localised homology domain that is part of larger sequences) we can
force it to favour local or global alignment behaviour, or any intermediate
thereof.
(1) Bucher P., Bairoch A. Proc. ISMB-94, pp. 53-61, AAAI/MIT Press (1994).
(2) Gribskov M., McLachlan A.D., Eisenberg D. Proc. Natl. Acad.. Sci. USA 84:4355-4358 (1987).
(3) Henikoff S., Henikoff, J.G. Proc. Natl. Acad. Sci. USA 89:10915-10919 (1992).
Like most similarity search techniques, a protein database search with a profile returns a sorted list of potential matches ranked by a quality score. Because there is no statistical theory that allows for direct computation of the probability of obtaining a certain score by chance, one has to rely on empirical methods for significance estimation. Such methods typically attempt to fit the parameters of a mathematical function to the score distribution of chance matches found in real or random sequences. If random sequences are used, it is important that the sequences are generated with a procedure that preserves certain statistical properties of biological sequences known to have an influence on the score distribution such as compositional bias and the actual length distribution.
The specific method we use for significance tests with PROSITE profiles uses a regionally shuffled version of SWISS-PROT preserving the original length distribution and amino acid composition in successive windows of length 20 [1]. Each profile is compared against this random database to produce a list of high-scoring profile matches sorted by score. The score distribution is then analysed by plotting the logarithm of the number of observed matches above a given score against the score itself. Such a plot typically shows an approximately linear relationship between these two variables, which would be expected for an extreme value distribution:
where NDB is the number of residues in the database. The parameters a and b are estimated by linear regression analysis and used to calculate a normalised score:
Note that a and b are characteristic parameters of a profile which need to be re-estimated whenever a profile is modified. The probability of finding a match with a given score in a database of a given size can be computed form the normalised score by subtracting the logarithm of the number of residues found in the database.
This value is referred to as P-value and used as significance estimate for
high-scoring profile matches in the real protein sequence database.
There are two caveats with this method. The score distribution obtained from the shuffled database with a certain profiles can substantially deviate from an extreme value distribution in a way leading to excessively conservative P-value estimates. We therefore visually inspect the random score distribution for each profile. If we observe any anomalies, we construct a new profile with another method. The other caveat is that this method cannot be applied to sequences included in the multiple alignment from which the profile was generated, nor to closely related sequences. In fact, a high-scoring false member included by accident in a multiple alignment is very likely to be picked up as a significant match in a subsequent profile search using this procedure. It is therefore very important that the initial sequence alignment used in a profile-based study of a new sequence motif does not contain any sequence of uncertain status.
(1) Pearson, W.R. and Lipman, D.J., Proc. Natl. Acad. Sci. USA 85:2444-2448 (1988).
As mentioned before, it is very important not to include unrelated proteins
into the sequence set used for the profile construction. Every potential
candidate sequence detected in the database search has to be carefully checked
before it can be used for the iterative profile refinement.
The most important condition a potentially new family member has to meet is the
statistical significance of the profile score, as described in section 5. For
sequences with no available functional or structural information, we usually
require a residual error probability p<0.01. If biological or structural
data suggest a meaningful relationship of the test sequence to the family under
consideration, we use a relaxed stringency criterion. A frequent example for
this situation is the potential occurrence of additional copies of a repeat
domain in a protein that already contains several `accepted' copies of this
domain.
If biological or structural data apparently contradict the relationship
indicated by the profile score, we increase the stringency up to p<0.0005.
In cases where even this condition is met, a re-evaluation of the contradicting
data is indicated.
The generalised profile syntax has been designed to be a superset of most previously used motif descriptors. This allows the import from existing domain- and pattern collections. Examples of foreign formats that can be converted into generalised profiles without loss of information are: PROSITE-patterns[1], weight-matrices[2], flexible patterns[3], `classical' profiles[4], and linear Hidden Markov Models (HMMs)[5,6]. The latter conversion is of particular interest, since HMM-training methods have proved to be very effective in some cases, even when using initial sets of unaligned proteins.
(1) Bairoch A. Nucleic Acids Res. 21:3097-3103 (1993).
(2) Staden R. Nucleic Acids Res. 122:505-519 (1984).
(3) Barton G.J., Sternberg M.J. J. Mol. Biol. 212:389-402 (1990).
(4) Gribskov M., McLachlan A.D., Eisenberg D. Proc. Natl. Acad.. Sci. USA 84:4355-4358 (1987).
(5) Krogh A., Brown M., Mian I.S., Sjölander K., Haussler D. J. Mol. Biol. 235:1501-1531 (1994).
(6) Baldi P., Chauvin Y., Hunkapillar T., McClure M. Proc. Natl. Acad. Sci. USA 91:1059-1063 (1994).
The feasibility of a mutual interconversion between generalised profiles and HMMs offers some interesting possibilities. The training methods for linear Hidden Markov Models can, in principle, work with sets of unaligned sequences. However, the use of a suitable initial model significantly improves the quality of the trained output model. Our preliminary experiments using converted generalised profiles as initial model for HMM-training yielded promising results. After finishing the HMM-training, the resulting output model can be re-imported into the generalised profile format and used with conventional database searching programs.
SH2 domain**, SH3 domain**, PH domain**, C1 domain**, C2 domain**, PID domain**, rasGAP domain*, rhoGAP domain*, rapGAP family*, rabGAP family, arfGAP family, cdc24-type rasGRF domain*, cdc25-type rasGRF domain*, rcc- repeat (ranGRF)*, rhoGDI-family, rabGDI-family
C3H2C3-type RING finger*, rsp5/WW-domain*, forkhead-associated (FHA) do- main*, polo-box*, death-box*, lipoxygenase appendage domain*, bromo-domain*, chromo-domain*, IQ-domain*, BTB-domain*, a-latrotoxin receptor interaction domain
forkhead-domain*, myb-domain*, ets-domain*, HMG (high-mobility group) do- main*, MCM (mini-chromosome maintenance) domain, KH-domain*
protein-kinase domain**, lipid kinase (PI3K) domain*, PI-specific PLC X-box** and Y-box**, bacterial PLC-domain, bacterial SMase, intracellular (plant-type) PLD, extracellular PLD, intracellular PLA2, HECT-domain (ubiquitin-transferase)*
leucin-rich repeat*, LRR-flanking regions*, TPR repeat**, wd40 repeat**, ankyrin repeat*, spectrin repeat *, gelsolin repeat *, filamin/ABP280 repeat
cub domain**, anaphylotoxin-domain**, saposin II-domain*, C-type lectin do- main*, thrombospondin type I domain*, archaebacterial surface layer repeat*
hsp20-family**, globin-family**, cpn10-family*, ricin-family*, IMB/FBP/IPP- family*
Profiles that are already part of PROSITE are labeled with (**), experimental profiles that are not yet part of PROSITE but can be searched using our WWW-server (http://ulrec3.unil.ch) are labeled with (*).