SIB Home page EPD Swiss EMBnet node SIB Lausanne Contact MADAP developers ISREC

Description of MADAP

MADAP is a flexible clustering tool for the interpretation of one-dimensional genome annotation data mapped onto complete or partial genome sequences. Such data might consist in counts, probabilities, or intensities and be obtained from cDNA and tag sequencing protocols to map the 5' and 3'ends of mRNA, from ChIP-chip analysis, or from genome-wide SNP-typing used in genotype-phenotype association studies. MADAP identifies groups of data corresponding to one or several genomic sites, and estimates the volume and extension of such groups (clusters).

Input: set of integer numbers, typically representing genomic positions. These numbers can occur multiple times, according to the strength of a measure at a given genomic site. The MADAP web-server accepts data in different formats, including gff files (see below).

Processing: MADAP models one-dimensional input data by a set of clusters defined by center and range, which have sufficient support from the data and which are compatible with constraints defined by the user.

Output: clustering results are provided in graphical and tabular form, accompanied with descriptions of iteration steps of the algorithm.

Source code: MADAP is also available as stand-alone program ftp. A more detailed description is available here.

Publication: Schmid CD, Sengstag T, Bucher P and Delorenzi M. (2007) MADAP, a flexible clustering tool for the interpretation of one-dimensional genome annotation data. Nucleic Acids Res, 35, W201-205.

MADAP submission form


Data input:
upload (Demo file) :
view Demo file
upload (Demo gff file) :
view Demo gff file
Model initialisation parameters
     Minimal number of clusters : (-m, kminnbcomponents)
     Maximal number of clusters : (-M, kmaxnbcomponents)
     Integration range : (-c, kdefaultfusionsdist)
     Background subtraction : (-s, kminnbpoints)
Model constraints parameters
     Min. distance btw peaks of clusters : (-p, kminimalpeakdist)
     Min. points in cluster : (-n, kminnbdatapoints)
Model fitting parameters
     Standard deviation : (-d, kfixvar)
     Adapting standard deviation :
     Probability of error data : (-e, kerrorprior)
Output options
     Mode for highest likelihood computation : classical mixture model likelihood
full attribution of points to clusters
     1rst extended reporting range : (-u, krefdist1)
     2nd extended reporting range : (-w, krefdist2)
Options graphical output
     X axis label:
     Y axis label:
     Box around cluster in global plot:






MADAP parameter explanations

(For older browsers not supporting javascripts)

upload file format :

Numbers representing genomic positions separated by space, tabulator, ... (any whitespace).
These genomic positions might represent mapping data from large-scale sequencing of mRNA 3' and 5'ends, ChIP-chip analysis
Position numbers occur at a frequency corresponding to the strength (or intensity) of a measure at this given position.
MADAP has limits of maximally 500'000 data points at maximally 100'000 different positions.

Demo file :

contains tab-separated genomic positions of 5'ends of full-length transcripts of a human gene. Each position occurs at a frequency corresponding to that of full-length transcripts starting at this position.
Suggested parameters for MADAP on this data set: -m 1 -M 16 -c 5 -s 0 -p 50 -n 200 -d 20 -e 0.02 -u 6 -w 11

gff format :

http://www.sequenceontology.org/gff3.shtml
The strength (or intensity) of a measure is represented by an integer number (<1000) in column 6.
The 5'end of the described feature is used as position:
-> column 4 in case of positive transcriptional orientation ('+' in column 7),
and column 5 in case of negative orientation.
MADAP has limits of maximally 500'000 data points at maximally 100'000 different positions.

Demo gff file :

contains (remapped) genomic positions of Nimblegen probes of a ChIP-chip experiment (
GEO GSE2672) on parts of human chro 12. The intensity of the hybridization signal has been transformed into frequency scores using int(10**(ChIP log-ratio)) -1 with an arbitrary maximum of 198.
Suggested parameters for MADAP on this data set: -m 7 -M 20 -c 500 -s 0 -p 1000 -n 20 -d 300 -e 0.002 -u 600 -w 1500
For comparison, see genome annotations for same region in ENSEMBL

Minimal number of clusters:

Model fitting is using at least this number of clusters (> 0). The minimal number of clusters has to be larger or equal to the maximal number of clusters.

Maximal number of clusters:

Model initialization is using up to this number of clusters, additionally limited by the maximal number of distinct positions in the data.
On the web server, this parameter is limited to 50. Users of the MADAP web server are advised to split up their input data into smaller portions, if there are more than ~45 clusters expected in the initial set.

Integration range:

Initial estimate for a distance within which points can be assumed to belong to the same cluster.
Suggested value: half-width of the smallest cluster you might expect.

Background subtraction:

The frequencies at all positions are reduced by a background count indicated by this parameter. This substraction affects only the definition of initial centers in the model initialization, the EM and the likelihood are calculated with full data!

Minimal distance btw peaks of clusters:

When model fitting produces two clusters whose representative peaks are closer than this distance, the cluster that contains less points is eliminated and the model fitting is repeated with a reduction of the number of clusters by one.

Minimal points in cluster:

When model fitting produces a cluster that contains less than this number of points, this cluster is eliminated and the model fitting repeated with a reduction of the number of clusters by one.

Standard deviation:

Value for a fixed standard deviation of individual clusters used in model fitting.

Adapting standard deviation:

Standard deviation of individual clusters is adapted to data during model fitting. Selecting this option implies ignoring the value entered above for standard deviation (parameter -d.

Probability of error data:

This value is used in the model fitting as an estimate of the proportion of data points that belong to a random point background distribution, respectively that do not belong to any of the clusters to be included in the model.

Mode for highest likelihood computation:

Choose the mode to select for the graphical display the model with the highest likelihood. Text output reports results of both modes, which frequently coincide.
A more detailed description is available
here.

1rst extended reporting range:

This value specifies the distance from the peak of each cluster, for which the number and the fraction of points is reported in the text output. This option does not influence the clustering, but might help to identify too narrow cluster definitions.

2nd extended reporting range:

This value specifies the distance from the peak of each cluster, for which the number and the fraction of points is reported in the text output. This option does not influence the clustering, but might help to identify too narrow cluster definitions.

X axis label:

Label to be used on X axis in final plots. This is typically the genomic position.

Y axis label:

Label to be used on Y axis in final plots. This is typically the frequency of TSS.

Box around cluster:

Check this box to add a rectangle (bounded by plus/minus one standard deviation) around the clusters which have been identified.