Lectures

WCMD 2008: Course: "Using DNA Microarrays for diagnosis and prognosis. The machine learning story"

Goals

The main goal of this course is to introduce the audience to a number of different techniques used in analyzing microarray data. The data analysis side is emphasized, but a brief overview of the technological basis is also given.

Slides

WCMD 2008: Tutorial session "Statistical data exploration and analysis with R"

Goals

This tutorial is designed to serve as an accelerated introduction to data analysis in R. It targets a technical audience, with prior programming experience and with at least a basic level in applied statistics. The goal of the tutorial is to show how one can use R to perform a number of different basic data analyses and to serve as a starting point for further investigation. It also will help those attending the series of lectures on microarray data analysis.

Slides

Here are the slides. Those wishing to print the slides, might prefer a different layout: 2 slides per page or 4 slides per page.

SIB/EMBnet: Statistical analysis applied to genomic and proteome analyses

Slides

Here are the slides from the presentation. If you want to print them out, you may find it more convenient to use either the 4 slides per page layout or the 2 slides per page layout, which leaves you some space for notes.

Exercise session (~120 mins.)

  • Load data (MDA dataset) and get used with it:
    X = as.matrix(read.table("expression.txt", header=T, 
    	colClasses=c("character",rep("numeric",130)), row.names=1))
    A = data.frame(read.delim("probe_annotation.txt"))
    Z = as.matrix(read.table("clinical.txt", header=T, 
    	colClasses=c("character", rep("numeric",3)), row.names=1))
    
  • Check the perf.r and kcv.r files and try to understand the parameters, the returned values and how CV should be used.
  • Check the fselect.r file for a simple implementation (taken from the SMA package) of a feature selection strategy.
  • Estimate the error rate, sensitivity, specificity and AUC for an LDA classifier of ER status, built using top 3 probesets selected with bwss(). You can check the mda-ex01.r file. Experiment with different settings for the cross-validation. Plot the histogram of the error rates from repeated CVs. Compare the "textbook" confidence intervals with the empirical quantiles. Experiment with different number of probesets. Examples:
    r = do.training(Z[,"ER.status"], X, kfold=5, rep=1, ntop=3)
    r = do.training(Z[,"ER.status"], X, kfold=3, rep=10, ntop=3)
    boxplot(data.frame(Error=r$err, Sn=r$sens, Sp=r$spec, AUC=r$auc))
    quantile(r$err, probs=c(0.025,0.975))
          2.5%      97.5%
    0.04940476 0.07142857
    
  • [30 mins for a quiz session]
    1. Estimate the performance of a classifier trained to predict the pathologic complete response (see points 1-6 below). Please gather all the answers in a document and email it (at the end of the session) to vlad.popovici@isb-sib.ch Do not forget to include your name in the document.
    2. Estimate the performance of LDA trained on top 15 probesets, using repeated (20 times) stratified 5-fold cross-validation. (While waiting for the results, prepare the commands for points 3-4)
    3. Report the mean and standard deviation of the error rate, AUC, sensitivity and specificity.
    4. Build a new training set X1 by selecting the top 15 probesets (using BWSS()) from the full dataset. Hint: look in the mda-ex01.r file for the lines corresponding to feature selection and do the same on the full matrix X.
    5. Optimistically biased estimates of the performance: Estimate the (biased) performance of an LDA trained on X1 to predict the pathologic complete response, using the same scheme as for point 1.
    6. Plot a boxplot of the correctly estimated error rate and AUC along with the biased estimates of the same quantities. Comment the results.
    7. Train an LDA on the X1 data and write a function for drawing the ROC. Hint: use mda-ex01.r for an example of how to build the LDA mode and how to predict the posterior probabilities.
  • (if time allows) Using the nested CV scheme, estimate the performance of an LDA with optimal number of probesets selected from the {5,10,15,20,25} set. See mda-ex02.r file and use it as a template. Finally, build the optimal classifier on the full dataset.

Bibliography:

Books

  • Hastie, Tibshirani, Friedman: The elements of statistical learning
  • Webb: Statistical pattern recognition
  • Duda, Hart, Stork: Pattern classification
  • Bishop: Pattern recognition and machine learning

Some articles on error estimation

  • T.Fawcett: ROC Graphs: Notes and Practical Considerations for Data Mining Researchers, HP Tech. Report HPL-2003-4
  • W.Jiang, S.Varma, R.Simon: Calculating Confidence Intervals for Prediction Errors in Microarray Classification Using Resampling, Gene Expression
  • J.K.Martin, D.S.Hirschberg: Small Sample Statistics for Classification Error Rates, parts I and II

CSS Valid & XHTML 1.0 Strict Valid