Lectures
WCMD 2008: Course: "Using DNA Microarrays for diagnosis and prognosis.
The machine learning story"
Goals
The main goal of this course is to introduce the audience to a number of
different techniques used in analyzing microarray data. The data analysis side
is emphasized, but a brief overview of the technological basis is also given.
Slides
WCMD 2008: Tutorial session "Statistical data exploration and analysis with R"
Goals
This tutorial is designed to serve as an accelerated introduction to data analysis
in R. It targets a technical audience, with prior programming experience and with
at least a basic level in applied statistics. The goal of the tutorial is to show
how one can use R to perform a number of different basic data analyses and to
serve as a starting point for further investigation. It also will help those attending
the series of lectures on microarray data analysis.
Slides
Here are the slides. Those wishing to print the
slides, might prefer a different layout: 2 slides per
page or 4 slides per page.
SIB/EMBnet: Statistical analysis applied to genomic and proteome analyses
Slides
Here are the slides from the presentation. If you want to print
them out, you may find it more convenient to use either the 4 slides per page
layout or the 2 slides per page layout, which leaves you some space for
notes.
Exercise session (~120 mins.)
- Load data (MDA dataset) and get used with it:
X = as.matrix(read.table("expression.txt", header=T,
colClasses=c("character",rep("numeric",130)), row.names=1))
A = data.frame(read.delim("probe_annotation.txt"))
Z = as.matrix(read.table("clinical.txt", header=T,
colClasses=c("character", rep("numeric",3)), row.names=1))
- Check the perf.r and kcv.r files
and try to understand the parameters, the returned values and how CV should be used.
- Check the fselect.r file for a simple implementation
(taken from the SMA package) of a feature selection strategy.
- Estimate the error rate, sensitivity, specificity and AUC for an LDA
classifier of ER status, built using top 3 probesets selected with bwss().
You can check the mda-ex01.r file. Experiment with different settings
for the cross-validation. Plot the histogram of the error rates from repeated
CVs. Compare the "textbook" confidence intervals with the empirical quantiles.
Experiment with different number of probesets.
Examples:
r = do.training(Z[,"ER.status"], X, kfold=5, rep=1, ntop=3)
r = do.training(Z[,"ER.status"], X, kfold=3, rep=10, ntop=3)
boxplot(data.frame(Error=r$err, Sn=r$sens, Sp=r$spec, AUC=r$auc))
quantile(r$err, probs=c(0.025,0.975))
2.5% 97.5%
0.04940476 0.07142857
- [30 mins for a quiz session]
Estimate the performance of a classifier trained to predict the pathologic complete
response (see points 1-6 below). Please gather all the answers in a document and
email it (at the end of the session) to vlad.popovici@isb-sib.ch Do not forget to
include your name in the document.
- Estimate the performance of LDA trained on top 15 probesets, using repeated
(20 times) stratified 5-fold cross-validation. (While waiting for the results,
prepare the commands for points 3-4)
- Report the mean and standard deviation of the error rate, AUC, sensitivity
and specificity.
- Build a new training set X1 by selecting the top 15 probesets (using BWSS())
from the full dataset. Hint: look in the mda-ex01.r file for the lines
corresponding to feature selection and do the same on the full matrix X.
- Optimistically biased estimates of the performance: Estimate the (biased)
performance of an LDA trained on X1 to predict the pathologic complete response,
using the same scheme as for point 1.
- Plot a boxplot of the correctly estimated error rate and AUC along with the
biased estimates of the same quantities. Comment the results.
- Train an LDA on the X1 data and write a function for drawing the ROC. Hint:
use mda-ex01.r for an example of how to build the LDA mode and how to predict
the posterior probabilities.
- (if time allows)
Using the nested CV scheme, estimate the performance of an LDA with optimal number of
probesets selected from the {5,10,15,20,25} set.
See mda-ex02.r file and use it as a template.
Finally, build the optimal classifier on the full dataset.
Bibliography:
Books
- Hastie, Tibshirani, Friedman: The elements of statistical learning
- Webb: Statistical pattern recognition
- Duda, Hart, Stork: Pattern classification
- Bishop: Pattern recognition and machine learning
Some articles on error estimation
- T.Fawcett: ROC Graphs: Notes and Practical Considerations for Data Mining Researchers,
HP Tech. Report HPL-2003-4
- W.Jiang, S.Varma, R.Simon: Calculating Confidence Intervals for Prediction Errors
in Microarray Classification Using Resampling, Gene Expression
- J.K.Martin, D.S.Hirschberg: Small Sample Statistics for Classification Error Rates,
parts I and II
|
|