Let's have a look at a few examples.
A)
A sizeable fraction of eukaryotic promoters contain a so-called TATA-box upstream of the initiation site. Let's suppose that we already know the approximate structure of this element and that we are primarily interested in its location relative to the initiation sites. To answer this question go to the OProf page (OProf stays for occurrence profile) and follow the instructions below:As you can see, there are a number of parameters you can play with in order to make a graphically appealing signal occurrence profile.
B)
Let's now look at bacterial translation initation sites.
To analyse this motif,
chose an oligonucleotide near the 3'end of E.coli ribosomal
RNA,
e.g.
CCUCCU, and produce a signal occurrence profile
for its complement (AGGAGG in proposed example) for
translation start site regions of several bacterial species.
Note that you can select several species at once (up to four)
in order to combine several signal occurrence profiles in
one graph. To start this analysis, we propose to analyse
and compare the Shine-Dalgarno interaction regions of the
extensively studied species E. coli and
B. subtilis.
Bacteria do not only use ATG as translation initiation codon,
but at lower frequences also GTG, CTG, and TTG. Determine the
frequencies at which these codons are used in various prokaryotic
species using the OProf service.
2. FPS-dependent sequence retrieval
As mentioned before, SSA programs typically do not
use sequences as input but lists of computer-readable
pointers to sequence positions in a database.
Such a list of pointers is called a functional position set,
or FPS. Each pointer contains a sequence id, a position, and
two flags, one indicating the strand (+ or -), the
other one the topology (1=linear, 0=circular). The Eukaryotic
Promoter Database is an example of a functional position set. To further
illustrate this concept, let's now go for a short moment
to the EPD pages:
http://www.epd.isb-sib.ch/Display an individual promoter entry in text format, for instance
http://www.epd.isb-sib.ch/cgi-bin/get_doc?db=epd&format=text&entry=HS_MYC_1The computer-readable pointer to the sequence position is contained in the line starting with the line code FP.
http://www.isrec.isb-sib.ch/ssa/data/fps/pro/epd_nr.fps http://www.isrec.isb-sib.ch/ssa/data/fps/bac/Bacillus_subtilis.fps
Now, from the EPD home page, follow the link: Download promoter sequences. This page allows you to extract promoter sequence segments around transcription initiation sites from EPD, and from various predefined subsets of EPD. The user can specify the relative 5' and 3' borders of the sequence regions to extract. Use this page to download promoter sequence files in Fasta format corresponding to the following subsets.
All promoters Vertebrate promoters Plant promoters Arthropode promotersActivate the switch "Representative set of not closely related sequences" at the bottom of the page. Specify sequence region -499 to +100. Note at this point, that the base corresponding to the first transcribed base of the RNA is numbered 0. The total length of the extracted sequence fragments is thus exactly 600. The result has to be saved in text format.
These files can be uploaded to the signal search signal search server. Try to reproduce one of the signal occurrence profiles you made before by uploading the promoter sequence file containing the non-redundant subset of all promoter sequences. Note that you have to indicate the relative internal position of the functional site on the OPROF form (500 in this case). You can also specify a name for the sequence set (e.g. epd_nr) and a description of the site type (e.g. "Transcription start site") on the form. The contents of these fields will appear in the graphical output produced by the signal search analysis server.
Input to a constraint analysis is a functional position set (FPS) and a so-called "signal sequence collection". The latter may consist of a complete set of oligonucleotides of particular length. Like in OProf, the sequences extracted with the FPS are scanned with a sliding window. The frequencies of the elements of the signal sequence collection are determined for each window. This gives rise to a two-dimensional array of numbers called "signal search data". In windows with high sequence constraints, a few oligonucleotides may occur at very high frequencies while most others occur at frequencies slightly below expectation. This would lead to a relatively high variance of "signal frequencies" (original jargon). The constraint index displayed in a constraint profile is in fact based on the variance of the signal frequencies.
Let's look at an example:
XNX -> ANA,ANC,ANG,ANT,CNA,CNC,CNG,CNT,GNA,GNC,GNG,GNT,TNA,TNC,TNG,TNT,Different types can be combined in one collection but they all have to be of the same length.
Further suggestion:
Use Slist to further investigate the signals corresponding to the constraint regions found in eukaryotic promoters and bacterial translation start regions.
0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 Cut-off value: 5Convince yourself of this equivalence by generating the same signal occurrence profile with this motif for eukaryotic promoters, once with a consensus sequence and once with a weight matrix.
It is not a trivial task to find an optimal weight matrix description for a motif like the TATA-box. The program PATOP PatOp (for pattern optimization) implements an iterative procedure which successively optimizes the weight matrix, the cut-off value, and the borders of the preferred region of occurrence, keeping two of these three components constant at a time. PAPOP has the capability of extending the matrix to the left and right side if additional consensus is observed, or to drop positions in the opposite case.
Use this program to produce a weight matrix description of the TATA-box motif for the non-redundant insect and plant promoter sets (they are relatively small and thus do not take too much time). Use default parameters for this purpose (a detailed understanding of the parameters of the PatOp algorithm is beyond the scope of this tutorial). Start from the consensus sequence motif TATAAA (one mismatch).
PatOp uses a heuristic algorithm converging to a local optimum. To test convergency, start the iterative refinement process from another initial motif, for instance TATAAT.
Try also to derive a weight matrix for the Shine-Dalgarno interaction region of a completely sequenced bacterium. Are the weights of the matrix found to be compatible with the assumption that G:U pairs can also be formed between the mRNA leader and the 3'end of the 16s RNA?
A sequence set of yeast splice acceptor sites can be found here.
http://www.isrec.isb-sib.ch/ssa/data/yeast_ag.seqThe sequences extend from -200 to +100 relative to the 3' end of the intron. Use your skills learned during the previous exercises to characterize the branchpoint consensus sequence of budding yeast.