USER MANUAL
Philipp Bucher and Vidhya Jagannathan,
Swiss Institute of Bioinformatics
and Swiss Institute for Experimental Cancer Research
Ch. des Boveresses 155
CH-1066 Epalinges s/Lausanne
Switzerland
Electronic mail via:
http://www.htpselex.isb-sib.ch/htpselex/webmail.htmlHome page http://www.isrec.isb-sib.ch/htpselex
This manual and the database it accompanies may be copied and redistributed freely, without advance permission, provided that this statement is reproduced with each copy.Published Research assisted by the HTPSELEX database cite:
Jagannathan, V., Roulet, E. and Bucher, P. (2006). HTPSELEX: A database for high-throughput SELEX data. Nucleic Acids Res. database issue , submitted
1. INTRODUCTION
2. DATA DESCRIPTION
3. FORMAT CONVENTIONS
3.1 Entry types and identifiers
3.2. Experiments (file htpselex.doc)
3.2.1 The ID line>link
3.2.2 The DE line>link
3.2.3 The FN line>link
3.2.4 The FC line>link
3.2.5 The FS line>link
3.2.6 The NS line>link
3.2.7 The SQ line blocks>link
3.2.8 The SX lines>link
3.2.9 The XR lines>link
3.3. Sequence data
3.3.1 Clone insert sequences
3.3.1.1. The ID line>link
3.3.1.2. Database cross-references
3.3.1.3. Features Header
3.3.1.3. Features tables
3.3.2. Tag sequences
The HTPSELEX database contains sets of in vitro selected transcription factor
binding site sequences obtained with the high-throughput SELEX (HTPSELEX)method described in (Roulet et al. Nat Biotechnol,2002 20:831),
In addition the database also contains binding sites obtained with conventional SELEX method.(Tuerk and Gold, Science,1990 249:505-510).
A typical SELEX experiment has the following steps :
1. INTRODUCTION
2. DATA DESCRIPTION"
A complete SELEX experiment starts with a purified nucliec acid binding protein and terminates with a computational model of its binding specificity.
Eacn entry in the database corresponds to one HTP SELEX or conventional SELEX experiment.
For each HTPSELEX and SELEX experiment the following details are recorded(if available for SELEX enteries)
3. FORMAT CONVENTIONS"
HTPSELEX entries are presented in a similar format as EMBL and SWISS-PROT sequence entries.
3.1 Entry types and identifiers
HTPSELEX database is distributed as three main flat files from our FTP server, each containing a collection of a particular entry type:
The trace files and binding models are available as compressed archives.
HTPSELEX entries have composite identifiers reflecting the hierarchical relationships between them. The components are alphanumeric strings separated by underscore characters. An experiment entry is identified by a short alphanumeric string, e.g. .NF1. for the CTF/NF1 experiment.
The clone sequence entries contain either a complete insert sequence or a partial sequence from the left or right. The latter occurs when the complete sequence of the insert could not be assembled from the sequencing reads. The clone sequence identifiers consist of the experiment Id, the cycle number, the clone number and optionally the sequencing direction (e.g. NF1_3_00001, NF1_3_0500_F).
The tag identifier consists of the experiment Id, cycle number, clone number, and tag serial number (e.g. NF1_3_00001_1).
3.2. Experiments (file htpselex.doc)
Each line of an experiment entry begins with a two character line code indicating the type of information contained
in the line. The entry description is based on 28 fields. The current line types and line codes and the order in which they appear in an entry, are shown below:
ID -identification
EN - entry name
DT - date of creation
DE - description
FN - factor name
FC - factor complex
FS - factor source
RN - reference number
RX - reference hyperlink.
RA - reference authors
RT - reference title
RL - reference link.
EX - Nature of DNA-protein binding experiment
NS - Sequence notation for input library,vector clip left, vector clip right,
- tag unit
SX - SELEX library descriptor
XR - database cross references.
// - Termination line
Spacer lines (XX) are inserted in order to make the database easier to read by eye. Some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//). Text does not exceed column 72.Below is an example of an entry:
ID NF1; HTS; version 1.
XX
EN CTF/NF1
XX
DT 09-Aug-2005
XX
DE HTP SELEX for transcription factor CTF/NF1, 4 cycles
XX
FN transcription factor CTF/NF1
FC A2
FS recombinant protein; vaccinia system
XX
RN [1]
RX PUBMED; 12101405
RA Roulet E, Busso S, Camargo AA, Simpson AJ, Mermod N, Bucher P.
RT High-throughput SELEX SAGE method for quantitative modeling of
RT transcription-factor binding sites.
RL Nat Biotechnol. 2002 Aug;20(8):831-5.
XX
EX HTPSELEX
XX
NS input; 78 bp
SQ tccatctctt ctgtatgtcg agatctannn nnnnnnnnnn nnnnnnnnnn nntagatctc
SQ ctaaccgact ccgttaatt
NS vector left; 62 bp;
SQ ggccgccagt gtgatggata tctgcagaat tccagcacac tggcggccgt tactagtgga
NS tag unit; 33 bp;
SQ tctannnnnn nnnnnnnnnn nnnnnnnnnt aga
NS vector right; 62 bp;
SQ tccgagctcg gtaccaagct tgatgcatag cttgagtatt ctatagtgtc acctaaatag
XX
SX Cycle 0; R25_0; 467 traces; 425 clones; 854 hq-tags
SX Cycle 1; NF1_1; 479 traces; 402 clones; 955 hq-tags
SX Cycle 2; NF1_2; 467 traces; 367 clones; 1203 hq-tags
SX Cycle 3; NF1_3; 1924 traces; 1425 clones; 5579 hq-tags
SX Cycle 4; NF1_4; 315 traces; 253 clones; 309 hq-tags
XX
XR gene A; HGNC; 7784; NF1A.
XR protein A; Uniprot/Swissprot; Q12857[1 ..399]; NF1A_HUMAN
XR input; HTPSELEX:R25
XR restriction endonuclease; REBASE:261; BglII (5' A|GATCT 3' TCTAG|A).
XR sequencing vector; pZERO-2T;EMBL:Y10545; ECY10545
XX
//
3.2.1 The ID line
The identification line is always the first line of an entry. The general form of the ID line is:
ID exp_type; version no.
The ID line is terminated by a period.
3.2.2 The DE line
DE HTP SELEX for transcription factor CTF/NF1, 4 cycles
The description line is free format and gives general information about the entry
3.2.3 The FN line
FN transcription factor CTF/NF1
The line describes the transcription factor used for the SELEX experiment.
3.2.4 The FC line
FC A2
The FC gives the factor complex. The factor complex describes the DNA-protein binding complexity
For example if the protein binds as a dimer to the DNA, the factor complex is described as A2, and if as monomer then it is given as A.
if heterodimer then the factor complex given as AB.
3.2.5 The FS line
FS recombinant protein; vaccinia system
The FS line describes the source of the factor. The format of the line is as follows
FS protein_type; production system
3.2.6 The NS line
Multiple NS lines gives a description of features of the SELEX cycle.
The general features given in this block are
3.2.7 The SQ line blocks
The SQ line blocks contain the sequence feature described in the NS line
NS input; 78 bp
SQ tccatctctt ctgtatgtcg agatctannn nnnnnnnnnn nnnnnnnnnn nntagatctc
The sequences are given in EMBL format of 60 nucleotides per line with substrings of 10 nucleotides.
3.2.8 The SX lines
The SX lines gives a complete description of each cycle of the HTP SELEX procedure.
It gives the number of traces, clones and high quality tags (hq-tags) are avaliable
for each of the cycles. ex.
SX Cycle 1; NF1_1; 479 traces; 402 clones; 955 hq-tags
The line also hyperlinks to actual data.
3.2.9 The XR lines
The XR lines are crosslinks to the various other databases. We have incorporated links to Uniprot/Swiss-prot,EMBL,HGNC and REBASE. The format of the lines depends on the target database.
XR gene A; HGNC; 7784; NF1A.
XR protein A; Uniprot/Swissprot; Q12857[1 ..399]; NF1A_HUMAN
XR input; HTPSELEX:R25
XR restriction endonuclease; REBASE:261; BglII (5' A|GATCT 3' TCTAG|A).
XR sequencing vector; pZERO-2T;EMBL:Y10545; ECY10545
3.3. Sequence data
The Sequence data is represented in EMBL-like format. Each sequence entry starts with an identifier line("ID") followed by further annotation lines.
3.3.1 Clone insert sequences
The Clone insert sequences are obtained after Phred/Phrap analysis of the trace files.
The start of the sequence is marked by a line starting with "SQ" and the end of the sequence is marked with two slashes("//").
3.3.1.1. The ID line
The ID line is of the format
ID cloneID standard; DNA; UNC; seqeuncelength BP.
cloneID:stable identifier, consisting of alphanumeric character, describing the transcription factor used
in the HTP SELEX experiment. All letters are in upper case.
standard:Entries which are complete to the standards described in this manual.
UNC: Unclassified division according to EMBL database division.
Sequence length: The total number of bases in the sequence
3.3.1.3. Features Table
The Feature label gives information about the tags in the insert sites.
Example
FT misc_binding 486..519
FT /bound_moiety ="CTF/NF1"
FT /label="NF1_3_00002_1"
FT /note="Base quality score is 3.2218e-08"
3.3.2. Tag sequences
REFERENCES
APPENDIX A: SURVEY OF CURRENT RELEASE