BIOS/BCB 784

Fall 2024 – 1:25-2:40 MoWe – McGavran 2306

This course makes extensive use of R and assumes basic familiarity with base R (not packages) as a prerequisite. A self-quiz is available here, with answers provided here. You can also find a list of base R functions that one should be familiar with.

**For 2024 BCB students**: BCB 720 (background on statistical inference, i.e. working with the likelihood for parameter inference, conditional probabilities) is suitable as a pre-requisite for BIOS/BCB 784.

For `Rmd`

files, go to the course repo and navigate the directories, or best of all to clone the repo and navigate within RStudio.

Week | Topic | Dir. | HW | HTML | Title |
---|---|---|---|---|---|

Bio Intro / GitHub | `-` |
github | RStudio, git, and GitHub | ||

Simple EDA | `eda` |
EDA | Exploratory data analysis | ||

NAs | Missing values in R | ||||

brain RNA | Exploring brain RNA | ||||

Bioconductor I | `bioc` |
objects | Bioc data objects | ||

ranges | Genomic ranges | ||||

GRL | GRangesList: lists of ranges | ||||

Bioconductor II | anno | Accessing annotations | |||

strings | Manipulating DNA strings | ||||

Multiple testing | `test` |
multtest | FDR and Benjamini-Hochberg | ||

localfdr | Local false discovery rate | ||||

IDR | Irreproducible discovery rate | ||||

Distances & norm. I | `dist` |
distances | Distances in high dimensions | ||

hclust | Hierarchical clustering | ||||

Models and EM | `model` |
EM | Expectation maximization | ||

motif | EM for finding DNA motifs | ||||

ChIP-seq | (Slides on Sakai) | ||||

Motifs part II | (In-class EM notes posted to GH) | ||||

Distances & norm. II | `dist` |
batch | Batch effects and sources | ||

sva | Surrogate variable analysis | ||||

Batch effect solutions | |||||

Hierarchical models | `hier` |
hierarchical | Hierarchical models | ||

jamesstein | James-Stein estimator app | ||||

Signal processing | `signal` |
hmm | Hidden Markov Models | ||

Tidy genomics | `-` |
tidy | Tidy ranges tutorial | ||

Network analysis | `net` |
network | Network analysis |

**What is the role of the computational biologist / statistician?**- All biology is computational biology Florian Markowetz
- Questions, Answers and Statistics Deborah Nolan
- 50 Years of Data Science David Donoho
- The Future of Data Analysis John Tukey (this article, discussed by Donoho, is from 1962)
- Ten Simple Rules for Effective Statistical Practice Kass, Caffo, Davidian, Meng, Yu, and Reid
- Statistical Modeling: The Two Cultures Leo Breiman

**Exploratory data analysis****Bioconductor****Distances and normalization**- Differential expression analysis for sequence count data Simon Anders and Wolfgang Huber
- Tackling the widespread and critical impact of batch effects in high-throughput data Leek et al
- Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis Jeffrey Leek and John Storey
- Normalization of RNA-seq data using factor analysis of control genes or samples Risso et al
- Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses Stegle et al

**Multiple testing**- Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Yoav Benjamini and Yosef Hochberg
- A direct approach to false discovery rates John Storey
- Statistical significance for genomewide studies John Storey and Robert Tibshirani
- Large-scale simultaneous hypothesis testing Bradley Efron
- Empirical Bayes Analysis of a Microarray Experiment Efron et al
- Measuring reproducibility of high-throughput experiments Li et al

**Expectation maximization**- What is the expectation maximization algorithm? Chuong B Do and Serafim Batzoglou
- Gaussian mixture models and the EM algorithm Ramesh Sridharan
- EM algorithm notes Andrew Ng
- MEME: discovering and analyzing DNA and protein sequence motifs Bailey et al

**Hierarchical models**- Linear models and empirical Bayes methods for assessing differential expression in microarray experiments Gordon Smyth
- Analyzing ’omics data using hierarchical models Hongkai Ji and X Shirley Liu
- Stein’s Paradox in Statistics Bradley Efron and Carl Morris
- Stein’s estimation rule and its competitors - an empirical Bayes approach Bradley Efron and Carl Morris

**Signal processing**- An Introduction to Hidden Markov Models Lawrence Rabiner and Biing-Hwang Juang
- Hidden Markov models approach to the analysis of array CGH data Fridlyand et al

**Network analysis**

- Online R Classes and Resources
- Rafael Irizarry and Michael Love, “Data Analysis for the Life Sciences” Free PDF, HTML
- Kasper Hansen, “Bioconductor for Genomic Data Science”
- Aaron Quinlan, “Applied Computational Genomics” (Slides)
- Jennifer Bryan et al, Stat 545
- Florian Markowetz, “You Are Not Working for Me; I Am Working with You”
- Tips to succeed in Computational Biology research

Some R resources

This is not nearly a complete list of topics in computational biology. The students taking the course are mostly graduate students in biostatistics, who have statistical background but not much exposure to genomic or biological datasets. Classic computational biology topics, such as alignment algorithms or molecular dynamics, are not covered, but instead the focus is on exploring genomic datasets and introducing the key statistical models that flourish in the high throughput setting (normalization, false discovery rate calculation, EM algorithm, hierarchical models, HMM, etc.). The course also focuses on R/Bioconductor, as this is a familiar tool for most of the students, and allows them to jump in to the data analysis. The goal is that exposure to these topics and these datasets will allow them to more effectively read the literature and pursue topics in biology and biomedical research.

This page was last updated on 06/03/2024.