The current release of TMbase (TMbase25) is based on SwissProt release 25. It has been briefly described in the meeting abstract:
K. Hofmann, W. Stoffel TMBASE - A database of membrane spanning protein segments Biol. Chem. Hoppe-Seyler 374,166 (1993)The data stored in TMbase contain, among other things, information on the following:
Most of the SwissProt-based information has been extracted from theSwissProt
annotation in a semi-automatical manner. If there are any errors in these
annotations, they will probably be propagated in TMbase, too.
However, some consistency checks have been performed though, and in the case
of any inconsistencies the contradictory information has not been included
into TMbase.
It should be noted, that errors in the annotation of transmembrane helices
are quite frequent since this reflect only the uncertainty of TM-domain
prediction and the lacking of experimental data.
The methodology for grouping the transmembrane proteins into families and for calculating the 'relative degeneracy' is explained in the paragraphs below.
Besides the individual TMbase tables in the *.txt files, the whole TMbase collection is present as a compressed tar-file (tmbase25.tar.Z) for use on unix-machines or as a compressed zip-file (tmbase25.zip) for use on PCs.
counter : integer : just an index ID : string10 : the SwissProt ID field Description : string80 : the SwissProt DE field (possibly truncated) Length : integer : the length of the protein No of TM : integer : the number of annotated TM-helices
helix-ID : string15 : unique Identifier for a helix ID : string10 : the associated ID of the tmb_e entry TM# : integer : position-number of the helix in the protein Start : integer : position of first residue in the helix Stop : integer : position of last residue in the helix Pre : string5 : the five residues N-terminal of the helix Transmem : string35 : the amino acid sequence of the helix Post : string5 : the five residues C-terminal of the helix Comment : string41 : any comments (like e.g. 'signal')
Orientation : string2 : io=inside to outside, oi=the opposite
inside' means normally the cytoplasmic face
outside' the lumenal face of the membrane
depending on the organelle (see tmb_mem)
ID : string10 : the SwissProt ID field
Nterm : string1 : either 'i' for inside or 'o' for outside
position of the N-terminus of the protein.
helix-ID : string15 : the helix identifier used in tmb_h
Nterm : string1 : either 'i' for inside or 'o' for outside
position of the N-terminus of the helix
Organelle-type: pro = prokaryotic cell membrane
er = eukaryotic ER-, Golgi- or Plasma-membrane
mit = mitochondrial membrane
chl = chloroplast membrane
vir = misc. viral membrane proteins
Subtype : er = ER membrane
gol = Golgi membrane
mic = unspecified microsomal proteins
pla = euk. plasma membrane
lys = lysosomal membrane
inn = inner membrane (e.g. mitochondrial)
out = outer membrane
thy = thylakoid membrane
env = envelope membrane (viral)
The records have the following structure:
ID : string10 : the SwissProt ID field Organelle-type : string3 : the organelle the protein is associated to Subtype : string3 : the subtype of the membrane (see above)
ID : string10 : the SwissProt ID field PAM80-Family : integer : the number of the family the protein belongs to PAM200-Family : integer : ditto PAM200-reduced: integer : ditto
PAM200r-Family : integer : the family number (see tmb_fam) population : integer : the number of members in the family function : string80 : short description of the family No of TM : string10 : No. of TM-helices of a typical member
ID : string10 : the SwissProt ID field all : real : Degeneracy measures, type II known orientation : real : ER, known orientation : real : all/PAM80 : integer : Degeneracy measures, type I all/PAM200 : integer : ER/PAM200 : integer :
ID : string10 : the SwissProt ID field Protein-ID: string4 : The part of the ID representing the protein Species-ID: string5 : The part of the ID representing the organism Group-ID : string5 : taxonomy ID in the SwissProt SPECLIST.TXT Spec-Type : string1 : one letter code for basic taxonomy Spec-No : integer : taxonomy number in the SwissProt SPECLIST.TXT
Group-ID : string5 : taxonomy ID in the SwissProt SPECLIST.TXT Spec-Type : string1 : one letter code for basic taxonomy Spec-No : integer : taxonomy number in the SwissProt SPECLIST.TXT 1st level : string27 : the levels of taxonomy 2nd level : string33 : . 3rd level : string41 : . 4th level : string37 : . 5th level : string22 : . 6th level : string16 : . 7th level : string19 : .
ID : string10 : the SwissProt ID field No of chains : integer : the number of annotated chains No of signals: integer : the number of annotated N-terminal signals.
The similarity method that is used in the files tmb_fam and tmb_deg is based
on the following procedure:
The 'allalldb'-function of the DARWIN server at the ETH-Zurich was used to
get similarity data for each possible pair of transmembrane proteins that
exceeds a certain similarity threshold. These similarity data include a
Smith-Waterman score and an estimated PAM value. (PAM means the number of
assumed point mutations per 100 residues).
In the next step, the proteins were grouped into families of sequences that
have a PAM distance of less than 80. Each protein that has a PAM value of
less than 80 to any member of an existing family was assumed to belong to
that family, too. This method does not implies that each pair of sequences
in a particular family must necessarily have a PAM distance less than 80.
The families resulting from this procedure are called 'PAM80-families'.
In a totally analogues fashion, PAM200 families were computed. These PAM200
families include even more distantly related sequences and, as a consequence,
there are fewer of these families, each containing more sequences.
Based on this family groupings, the degeneracy values 'all/PAM80', 'all/PAM200', and 'ER/PAM200' have been calculated. The 'all/PAM80' field for a given protein holds an integer number of family members belonging to the same family as the original sequence. The reciprocal value of this field could be used as a weighting factor in statistical analyses. The 'all/PAM200' field holds an analoguous value assuming PAM200 families. The 'ER/PAM200' field, that is present only for proteins belonging to the organelle type 'er' (see tmb_mem), contains the number of 'ER'-proteins belonging to the same PAM200-family as the original protein.
The similarity measure above includes only information of family-membership, each other member of the same family contributes by a value of '1', no matter if it is a close relative or a more distantly related sequence. To avoid this restriction, the 'typeII' degeneracy measure has been introduced. The 'type II' degeneracy of a given protein i, Deg(i), is given as the following sum
1
Deg(i)= Sum --------------
j 1+PAMdist(i,j)
where j is each of proteins belonging to the same family.
This too is clearly not a perfect measure for protein degeneracy.
Nevertheless, it has proved to be useful in avoiding large-family bias in
several statistical analyses.Kay Hofmann Bioinformatics group ISREC Chemin des Boveresses 155 CH-1066 Epalinges s/Lausanne Switzerland ------------------------------- FAX: +41 (21) 652-6933 Email: khofmann@isrec-sun1.unil.ch