Here I will introduce a popular statistical framework for working with the results of many parallel statistical tests, a setting that occurs frequently in genomic analyses. I won’t go into depth here, but will discuss more in lecture. For background you can also read the lecture notes from the HarvardX Biological Data Science series (also the Wikipedia article on multiple comparisons is a good introduction).

While up to this point, we have minimized the use of simulation data to show concepts in Computational Biology, multiple testing is often introduced using simulation data, and there is a practical reason why. We can use real biological datasets to generate test statistics from the null hypothesis, by taking a large group of samples and randomly dividing into two groups with no known biological difference. Unless we happen to choose a partition which corresponds to biological or technical factors, this is pretty good for observing null test statistics. And we can limit this probability by choosing a large dataset, or by balancing our random partition with respect to technical factors, or by removing technical factors using factor analysis methods such as SVA, RUV, or PEER.

However, it’s not easy to generate many test statistics from real data which combine known null and alternative hypotheses, because we often do not know which hypotheses are truly null, e.g. \(\Delta = 0\), and which are alternative, \(\Delta \ne 0\). So we often resort to simulation, or a combination of real null data with simulation used to create alternative hypotheses, to demonstrate the concepts of multiple test correction. Here we will use the latter, starting with a real dataset where we can create a null distribution of test statistics.

About the dataset

We start by downloading a large, normalized matrix of values for chromatin accessibilty across 68 human donors. Chromatin accessibility refers to a quantitative measurement of whether the DNA at a certain location is packaged tightly or in a more open pattern, which indicates the degree to which proteins can access and bind to DNA – in particular regulatory proteins which might affect transcription of genes. A diagram of inaccessible and accessible chromatin can be found on this Wikipedia page.

The dataset is hosted at this website, and we download a text file of normalized data contained in this subfolder, which is about 130 Mb, compressed. The publication associated with this dataset is listed at the above link, and the goal was to find genetic variants which are associated with differences in chromatin accessibility. We won’t discuss here exactly how the raw data was processed to produce the normalized matrix above, but for our purposes, all we need to know is that large values correspond to more accessible chromatin, each column corresponds to a different human donor, and each row corresponds to a distinct location in the genome. The rows are actually peaks of accessibility that were detected in at least some subset of the samples.

We have to use this skip=1 argument below because the first row contains only column names, but is missing a column name for the first column (the index number of the peak from a larger set of peaks).

url <- "http://chromovar3d.stanford.edu/QTLs/correctedSignal/hMat.norm.ALL.dhs.peer_lS_5.txt.gz"
file <- "hMat.norm.ALL.dhs.peer_lS_5.txt.gz"
if (!file.exists(file)) download.file(url, file)
library(readr)
dnase <- read_delim(file, delim=" ", skip=1, col_names=FALSE)
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_integer()
## )
## See spec(...) for full column specifications.
dnase
## # A tibble: 250,000 x 69
##       X1          X2         X3         X4         X5          X6
##    <int>       <dbl>      <dbl>      <dbl>      <dbl>       <dbl>
##  1     2 -2.07939782 -1.0919890  0.4358841  1.0771965  0.18202807
##  2     3  1.04012044 -0.6540832  0.5439511  0.5433546  0.33554692
##  3     5  0.61603730  1.6216130 -1.9550178  0.8038654 -1.17296933
##  4     6 -0.08366167  0.1037396 -0.7161806 -0.4290121 -1.05492685
##  5     7  0.71635444 -1.5309856  0.1680760 -1.3674195  0.18572840
##  6     8  0.21563757 -0.1195649 -0.4001214  0.1126866  1.67232404
##  7    16 -1.62173082 -1.9355604  1.5993482 -0.1015715  0.41055978
##  8    17  0.37951245 -1.4808371 -0.0526740 -0.9285295  0.25286624
##  9    18  1.70117468  0.3848198 -0.3243192  1.0260985  0.44060164
## 10    19 -1.94900586 -0.9384677 -1.2035325 -0.2577110  0.08515044
## # ... with 249,990 more rows, and 63 more variables: X7 <dbl>, X8 <dbl>,
## #   X9 <dbl>, X10 <dbl>, X11 <dbl>, X12 <dbl>, X13 <dbl>, X14 <dbl>,
## #   X15 <dbl>, X16 <dbl>, X17 <dbl>, X18 <dbl>, X19 <dbl>, X20 <dbl>,
## #   X21 <dbl>, X22 <dbl>, X23 <dbl>, X24 <dbl>, X25 <dbl>, X26 <dbl>,
## #   X27 <dbl>, X28 <dbl>, X29 <dbl>, X30 <dbl>, X31 <dbl>, X32 <dbl>,
## #   X33 <dbl>, X34 <dbl>, X35 <dbl>, X36 <dbl>, X37 <dbl>, X38 <dbl>,
## #   X39 <dbl>, X40 <dbl>, X41 <dbl>, X42 <dbl>, X43 <dbl>, X44 <dbl>,
## #   X45 <dbl>, X46 <dbl>, X47 <dbl>, X48 <dbl>, X49 <dbl>, X50 <dbl>,
## #   X51 <dbl>, X52 <dbl>, X53 <dbl>, X54 <dbl>, X55 <dbl>, X56 <dbl>,
## #   X57 <dbl>, X58 <dbl>, X59 <dbl>, X60 <dbl>, X61 <dbl>, X62 <dbl>,
## #   X63 <dbl>, X64 <dbl>, X65 <dbl>, X66 <dbl>, X67 <dbl>, X68 <dbl>,
## #   X69 <dbl>
dim(dnase)
## [1] 250000     69
col.names <- names(read.delim(file, nrow=1, sep=" "))
col.names
##  [1] "NA19160" "NA18861" "NA18862" "NA18909" "NA18907" "NA18517" "NA19119"
##  [8] "NA19116" "NA19114" "NA19171" "NA18870" "NA18486" "NA19206" "NA19207"
## [15] "NA19204" "NA19203" "NA19200" "NA19201" "NA19209" "NA19141" "NA19140"
## [22] "NA19143" "NA19144" "NA19147" "NA18498" "NA18499" "NA19099" "NA19098"
## [29] "NA19093" "NA19092" "NA18520" "NA18522" "NA19152" "NA19153" "NA19257"
## [36] "NA19159" "NA19239" "NA19238" "NA19127" "NA19225" "NA19222" "NA19223"
## [43] "NA19130" "NA19131" "NA18516" "NA18511" "NA18510" "NA19137" "NA19138"
## [50] "NA18519" "NA18507" "NA19190" "NA19210" "NA18858" "NA18859" "NA18913"
## [57] "NA18912" "NA18916" "NA18852" "NA18853" "NA18855" "NA18856" "NA18502"
## [64] "NA18501" "NA18504" "NA18505" "NA18508" "NA19101"

We have 250k peaks and 68 donors. We add the column names that we skipped before:

mat <- as.matrix(dnase[,-1])
dim(mat)
## [1] 250000     68
colnames(mat) <- col.names

I take a quick look at the histogram of values for the first four peaks:

par(mfrow=c(2,2))
for (i in 1:4) hist(mat[i,], col="grey")

Also, it’s always a good idea to make a PCA plot of the samples, to look for any large-scale structure in the data:

pc <- prcomp(t(mat))
plot(pc$x[,1:2])

If you square the standard deviation values and sum them, you get the total variance in the data. Here I show that the top PC is not a large percent of the variance in the data, which means that there is not a low dimensional structure to the data, which one might see with strong batch effects or distinct biological conditions.

plot(pc$sdev[1:10]^2 / sum(pc$sdev^2), type="b", ylab="% Var")

Multiple testing

To investigate multiple testing, we first generate a mock comparison: I choose two random groups of the data, indicated by a factor variable called fake and perform row-wise t-tests across all the 250k peaks in the data. I then look at the distribution of p-values across 20 bins defined on [0,1].

set.seed(1)
fake <- factor(sample(rep(1:2,each=ncol(mat)/2)))
library(genefilter)
ts <- rowttests(mat, fake)
nbins <- 20
brks <- 0:nbins/nbins
hist(ts$p.value, col="grey", breaks=brks)

Here I make this plot again, and show a line for what we would expect on average for each bin for null hypotheses. We can see we are pretty close to the expected values for this set of bins.

p <- ts$p.value
hist(p, col="grey", breaks=brks)
abline(h=length(p)/nbins, col="dodgerblue", lwd=3)

1/nbins
## [1] 0.05
sum(p < 1/nbins)
## [1] 12687
length(p)/nbins
## [1] 12500

Now, because we want to explore how multiple testing frameworks deal with combined null and alternative hypotheses, I will “spike in” 10,000 p-values corresponding to alternative hypotheses. I don’t generate the data, just p-values which, after log10 transformation, are uniformly distributed on [-6,-1].

p2 <- p
p2[1:10000] <- 10^runif(10000,-6,-1)

Correction with p.adjust

I demonstrate two types of correction for multiple testing, the Bonferroni method and the Benjamini-Hochberg method, which are both implemented in the p.adjust function in R. The p.adjust function provides adjusted p-values, such that thresholding on the adjusted p-values delivers a set obeying certain statistics properties. As we went over in lecture, the Bonferroni method bounds the family-wise error rate (FWER) while the Benjamini-Hochberg method bounds the false discovery rate (FDR) in expectation. A reminder: these are critically different bounds: FWER is the number of false positives over all null hypotheses, while FDR is the number of false positives over the set which is called positive. And the Benjamini-Hochberg method provides the FDR bound in expectation.

Below we show the histogram of adjusted p-values, and the minimal adjusted p-value. The adjustment is simply the original p-value multiplied by the number of tests. So we could provide FWER control for this particular simulated data for a few hundred tests, but only at a high rate.

padj.bonf <- p.adjust(p2, method="bonferroni")
hist(padj.bonf, col="grey", breaks=brks)

min(padj.bonf)
## [1] 0.2503065
min(p2) * length(p2) == min(padj.bonf)
## [1] TRUE
sum(padj.bonf < .5)
## [1] 638

The false discovery rate is a much more practical and desirable bound for genomic data analysis, compared to the family-wise error rate. Investigators are often interested in the rate of false positives among the set of most promising features (genes, or in this case, genomic regions) that are identified by a statistical test.

Here, we pick up on thousands of tests where we can control the FDR at 5%, for example.

padj.bh <- p.adjust(p2, method="BH")
hist(padj.bh, col="grey", breaks=brks)

min(padj.bh)
## [1] 0.0007430404
sum(padj.bh < .05)
## [1] 6585

One way to think about what our adjusted p-values are delivering, is to examine a histogram of the original p-values for various bin sizes along [0,1]. We see an enrichment of small p-values in the smallest bin. If we draw the line of expected counts if all the hypotheses were null, we see that the first bin could be expected to contain more than half null hypotheses. This corresponds to the largest adjusted p-value in this bin, which makes sense: by thresholding on an adjusted p-value, we should obtain a set of hypotheses bounded by a given false discovery rate in expectation.

nbins <- 20
brks <- 0:nbins/nbins
hist(p2, col="grey", breaks=brks)
abline(h=length(p)/nbins, col="dodgerblue", lwd=3)

max(padj.bh[p2 < 1/nbins])
## [1] 0.5801402

If we make the bins smaller, the ratio of expected nulls decreases relative to the height of the bar for the first bin, and again the maximum adjusted p-value matches the expected proportion in that bin.

nbins <- 100
brks <- 0:nbins/nbins
hist(p2, col="grey", breaks=brks)
abline(h=length(p)/nbins, col="dodgerblue", lwd=3)

max(padj.bh[p2 < 1/nbins])
## [1] 0.2413422

Runs of identical adjusted p-values

Finally, I want to show a property of adjusted p-values using the Benjamini-Hochberg (BH) method, which sometimes surprises users who aren’t familiar with the procedure. See the runs of identical adjusted p-values for the first 2000 sorted p-values:

padj.sort <- sort(padj.bh)
plot(-log10(padj.sort[1:2000]), xlab="i", ylab="p")

plot(-log10(padj.sort[1:1200]), xlab="i", ylab="p")

The publication of the BH method describes a procedure, whereby one finds the largest i such that the i-th smallest p-value is less than \(\frac{i}{m} q\), where q is the desired FDR bound. The adjusted p-values are the smallest value of q for each test, so a reverse of the procedure where you start with q. Geometrically, the BH procedure can be thought of putting the tests in order of p-value along the x-axis, with the value of p on the y-axis. For the first 2000 tests, this looks like:

p.sort <- sort(p2)
plot(1:2000, p.sort[1:2000], ylim=c(0,p.sort[2000]), type="l", xlab="i", ylab="p")

We then can find the smallest q to define a set, by drawing the line \(y(i) = \frac{q}{m} i\). So we draw a line that goes through the origin with increasing slope. This first touches our sorted values of p at an asymptotic point, and so all the p-values less than this point are assigned the same adjusted p-value. They could only have a smaller adjusted p-value if the line had touches the sorted p-values earlier.

q <- min(padj.bh)
i <- sum(padj.bh == q)
m <- length(p2)
plot(1:2000, p.sort[1:2000], ylim=c(0,p.sort[2000]), type="l")
abline(0, q/m, col="red")

q-values

Finally, we mention that there is a method which improves upon the BH method by attempting to estimate the proportion of nulls in the set of all hypotheses, called \(\pi_0 = \frac{m_0}{m}\). The BH method controls the FDR in expectation for any \(m_0 \le m\), but if we can come up with a good estimate of \(\pi_0\) when it is less than 1, we can reject more hypotheses and still control the FDR in expectation. Without getting into details, the qvalue function in the qvalue package does this estimation, and provides q-values which operate analogously to the adjusted p-values we defined above: using a threshold on q-values produces a set with FDR which should be bounded by that value in expectation. We can see that for this simulated data, the estimate of \(\pi_0\) is close to the truth:

library(qvalue)
res <- qvalue(p2)
res$pi0
## [1] 0.9512515
1 - 10000/length(p) # we simulated 10k alternative
## [1] 0.96

The q-values are multiplicatively scaled relative to the BH adjusted p-values, and are strictly smaller, although here only slightly so. For smaller \(\pi_0\) the gain in sensitivity would be more substantial.

qval.sort <- sort(res$qvalues)
plot(padj.sort[1:2000], qval.sort[1:2000])
abline(0,1)