Detection and Analysis of Distantly Related C2-like Membrane- Attachment Domains

Kay Hofmann and Philipp Bucher

ISREC Bioinformatics Group
Swiss Institute of Experimental Cancer Research
CH-1066 Epalinges s/Lausanne

The C2-domain of the classical protein kinase C isoforms, which is also present in synaptotagmin, has been shown to be important for Ca2+-dependent membrane attachment of these proteins. Subsequently, this domain of about 120 residues was also found in rabphilin, unc-13, and cytosolic phospholipase A2. By application of an improved profile- based search method we were able to detect previously unknown C2-like domains in a great variety of otherwise unrelated proteins.
Among these proteins are: several non-classical PKC isoforms, all isoforms of PI-specific phospholipase C, the plant phospholipase D, phosphatidylinositol-3-kinase, several GTPase-activating proteins, and as an exceptional case, the extracellular protein perforin. A common denominator of all these proteins is their requirement for membrane attachment, although not in all cases regulated by Ca2+. Most of these proteins are involved in the major intracellular signal transduction pathways.
A detailed analysis of the C2-domain sequences, in combination with the knowledge of the 3D-structure of one C2-domain, allows the determination of residues important for the various functional aspects of this domain. This knowledge can in turn be applied to predict the role of the C2-domain in several uncharacterized protein sequences that have become available by the genome projects.


The generalized sequence profile method

Application of sequence profiles is a very sensitive method for the discovery of distant sequence relationships. In contrast to conventional sequence comparison and database searching methods, not a single sequence is used as a query object but a profile constructed from a family of related sequences. These profiles are normally derived from multiple alignments of the initial sequence set. In addition to the sequences themselves, a profile contains the following information:

For the detection of distant relationships, the assessment of statistical significance is very important. We derive an estimation of the error probability from an empirical score distribution obtained with a randomized version of SwissProt (shuffling window 20). As can be seen from the figure below, in most cases the score distribution can be approximated by an extreme value distribution.

An additional advantage of the profile method is the possibility of iterative profile refinement. Sequences with highly significant similarity to the profile are aligned to the profile and included in the profile construction process for the next round of database searches.

Currently, we are constructing generalized profiles for the most important of the highly divergent protein domains and families. These profiles will be included in the PROSITE pattern collection. We are also trying to detect and describe previously unknown homology domains, with emphasis on proteins involved in signal transduction and regulatory pathways.

Selected references

Analysis of C2 domains:

Brose, N., Hofmann, K., Hata, Y., Sudhof, T.C. (1995)
J. Biol. Chem. 270:25273-25280

The original profile method:

Gribskov, M., McLachlan, A.D. and Eisenberg,D. (1987)
Proc. Natl. Acad. Sci. USA 84:4355-4358

Improvements to the profile method:

Luthy, R., Xenarios, I. and Bucher, P. (1994)
Prot. Sci. 3:139-146

Thompson, J.D., Higgins, D.G. and Gibson, T. (1994)
Comput. Applic. Biosci. 10:19-29

The generalized sequence profiles:

Bucher, P. and Bairoch A. (1994)
in: Proceedings of the 2nd ISMB Conference, pp. 53-61, AAAI press.

Bucher, P. Karplus, K. Moeri, N. and Hofmann K. (1996)
Computers and Chemistry 20:3-24

The PROSITE pattern library:

Bairoch, A. , Bucher, P. and Hofmann, K. (1996)
Nucleic Acids Res. 24:189-196


The score distribution of the C2-domain profile

The upper part of the figure shows the frequency distribution obtained by a run of the C2- profile against a randomized version of SwissProt.

[PostScript]

The lower part of the figure shows the decadic logarithm of the cumulative frequency distribution of the C2-domain profile against the randomized database. (the cumulative frequency value corresponding to the score 300 means the number of sequences giving scores better than 300).
For most profiles, the graph can be approximated by a linear relationship in the high score range.

[PostScript]

Figures

Alignment of some typical C2-like domains

[PostScript] [GIF]

Phylogenetic tree of some typical C2-like domains

[PostScript] [GIF]

The domain structure of the proteins containing C2-like domains

[PostScript] [GIF] [PostScript] [GIF]

Left: The Protein Kinase C family.
Right: The PI-specific Phospholipase C family.


[PostScript] [GIF] [PostScript] [GIF]

Left: Synaptotagmins and other proteins with multiple C2-domains.
Right: Miscellaneous proteins with C2-like domains.


More signal transduction and other domains

We have created profiles for the following domains and protein families. Some of them are already part of PROSITE, the others will appear there soon. A set of programs for the comparison of sequences with collections of profiles will be available at the time of the next PROSITE release.

domains in signal transduction proteins

SH2 domain**, SH3 domain**, PH domain**, C1 domain**, C2 domain**, PID domain**, rasGAP domain*, rhoGAP domain*, rapGAP family*, rabGAP family, arfGAP family, cdc24-type rasGRF domain*, cdc25-type rasGRF domain*, rcc- repeat (ranGRF)*, rhoGDI-family, rabGDI-family

putative intracellular protein/protein interaction domains

C3H2C3-type RING finger*, rsp5/WW-domain*, forkhead-associated (FHA) do- main*, polo-box*, death-box*, lipoxygenase appendage domain*, bromo-domain*, chromo-domain*, IQ-domain*, BTB-domain*, a-latrotoxin receptor interaction domain

some DNA/RNA-binding domains

forkhead-domain*, myb-domain*, ets-domain*, HMG (high-mobility group) do- main*, MCM (mini-chromosome maintenance) domain, KH-domain*

some catalytic domains

protein-kinase domain**, lipid kinase (PI3K) domain*, PI-specific PLC X-box** and Y-box**, bacterial PLC-domain, bacterial SMase, intracellular (plant-type) PLD, extracellular PLD, intracellular PLA2, HECT-domain (ubiquitin-transferase)*

some repeat domains

leucin-rich repeat*, LRR-flanking regions*, TPR repeat**, wd40 repeat**, ankyrin repeat*, spectrin repeat *, gelsolin repeat *, filamin/ABP280 repeat

a few extracellular domains

cub domain**, anaphylotoxin-domain**, saposin II-domain*, C-type lectin do- main*, thrombospondin type I domain*, archaebacterial surface layer repeat*

miscellaneous protein families

hsp20-family**, globin-family**, cpn10-family*, ricin-family*, IMB/FBP/IPP- family*

Profiles that are already part of PROSITE are labeled with (**), experimental profiles that are not yet part of PROSITE but can be searched using our WWW-server (http://ulrec3.unil.ch) are labeled with (*).