Here we will explore a number of issues involved in comparing samples in high-throughput data, and in computing distances between samples. The main issues I want to introduce are:
An ideal dataset for exploring these issues is a large dataset with both technical and biological aspects recorded as metadata. It’s hard with few samples to discern what are technical differences, and what is simply biological variability among the samples in a condition. Note that proper data analysis requires that biological aspects of experiments be randomized or blocked with respect to technical aspects. Below we will show the number of significant differences we find when we simply compare measurements across the sequencing center in which the samples were processed.
Here we will work with the GEUVADIS project data, which contains 465 unique RNA-seq samples. The RNA was extracted from lymphoblastoid cell line samples from 5 populations of the 1000 Genomes Project: the CEPH (CEU), Finns (FIN), British (GBR), Toscani (TSI) and Yoruba (YRI). The CEPH population consists of Utah residents with Northern and Western European ancestry.
We start by downloading the summarized data from recount2 (this downloads ~75 Mb).
library(SummarizedExperiment)
url <- "http://duffel.rail.bio/recount/ERP001942/rse_gene.Rdata"
file <- "geuvadis_recount2.rda"
if (!file.exists(file)) download.file(url, file)
load(file)
library(here)
source(here("bioc","my_scale_counts.R"))
rse <- my_scale_counts(rse_gene)
Now we read the GEUVADIS phenotypic data from a TSV file. This
contains additional data not present in colData
of the
SummarizedExperiment.
library(readr)
##
## Attaching package: 'readr'
## The following objects are masked from 'package:testthat':
##
## edition_get, local_edition
pheno <- read_tsv(here("dist","geuvadis.tsv"))
## Rows: 667 Columns: 25
## ── Column specification ────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (19): BioSample_s, Center_Name_s, Experiment_s, Library_Name_s, Run_s, SRA_Sample_s, Sample...
## dbl (4): AvgSpotLen_l, MBases_l, MBytes_l, InsertSize_l
## date (2): LoadDate_s, ReleaseDate_s
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df <- data.frame(pheno[,c("Run_s","Center_Name_s","population_s")])
df <- data.frame(lapply(df, factor))
colnames(df) <- c("run","center","population")
levels(df$center)
## [1] "Centre for Genomic Regulation, Barcelona"
## [2] "Centro Nacional de Análisis Genómico"
## [3] "Helmholtz Zentrum München, German Research Center for Environmental Health, Germany"
## [4] "Institute of Clinical Molecular Biology, University of Kiel, Germany"
## [5] "Leiden University Medical Center"
## [6] "Max Planck Institute for Molecular Genetics"
## [7] "University of Geneva"
## [8] "Uppsala University"
levels(df$center) <- c("CGR","CNAG","HZM","ICMB","LUMC","MPIMG","UG","UU")
head(df)
## run center population
## 1 ERR188398 MPIMG TSI
## 2 ERR188110 MPIMG TSI
## 3 ERR188253 ICMB TSI
## 4 ERR188407 MPIMG GBR
## 5 ERR188396 UG CEU
## 6 ERR204857 UG FIN
We have more rows in the TSV than samples in rse
. And
there appear to be technical replicates, as there are 465 samples in
GEUVADIS according to the project website.
nrow(df)
## [1] 667
ncol(rse)
## [1] 663
Here I make sure that all the samples in rse
are present
in df
, and then match up the data in df
and
add it to the colData
of rse
:
all(colnames(rse) %in% df$run)
## [1] TRUE
m <- match(colnames(rse), df$run)
rse$center <- df$center[m]
rse$population <- df$population[m]
The total number of counts per sample is a technical artifact, which we need to control for before computing distances. Let’s compute the sum of each column, and then divide by 1 million (we can call this “column sum per million”). This has a large range, from the sample with the least counts to the sample with the most counts.
The number of counts (either reads or pairs of reads which are referred to as fragments) is known as the sample’s sequencing depth.
cspm <- colSums(assay(rse))/1e6
range(cspm)
## [1] 8.550355 77.576739
Here I pick two samples in the dataset, with low and with high sequencing depth:
idx <- order(cspm)[c(100,500)]
cspm[idx]
## ERR205017 ERR188420
## 15.85876 31.59642
Let’s compare the counts for these two samples by making a simple scatter plot:
plot(assay(rse)[,idx])
Clearly, the log is useful here, as all the points are clustered near
the origin. It’s also useful to make the points somewhat transparent
using rgb
so we can see how they overlay. Finally we add a
line that indicates y=x.
plot(log10(assay(rse)[,idx] + 1),
cex=.5, col=rgb(0,0,0,.1), pch=20)
abline(0,1,col="red", lwd=3)
The problem with the scatterplot above is that, we really care about the differences between these two samples, but those differences are on a 45 degree angle. A common plot in genomics is to put the difference itself on the y-axis (typical the difference in log-scale measurements, so the log fold change). The x-axis can then be the average between the two values, whether on the original or log scale. This is called an “MA-plot” for minus over average, or a Bland-Altman plot. Note that we can now directly see the extent of the differences between these samples. And we can easily see the systematic technical factor associated with higher sequencing depth in the second sample.
x <- log2(assay(rse)[,idx[1]] + 1)
y <- log2(assay(rse)[,idx[2]] + 1)
plot(.5*(x + y), y - x,
cex=.5, col=rgb(0,0,0,.1), pch=20)
abline(h=0, col="red", lwd=3)
Remember, if we were to calculate Euclidean distances between samples on the log scale, we would be calculating the differences drawn below, squaring and summing them. So we definitely want to center these differences on the x-axis here, as we know the systematic difference is due solely to sequencing depth.
While we’re making this plot, also note the inflation of differences on the left side of the plot. This increase in variance depending on signal strength – called heteroskedasticity in statistics – is a technical artifact from the inherent count-based measurements of high-throughput sequencing data, and needs to be dealt with to compute biologically meaningful distances between samples.
plot(.5*(x + y), y - x, type="h")
As this is a graduate course in biostatistics, I’m assuming some familiarity with principle component analysis. PCA is an invaluable technique for high-throughput data, to capture in a 2D plot the inter-relationships between samples in the dataset. Note that multidimensional scaling (MDS) on a Euclidean distance matrix produces an equivalent plot as PCA.
Performing PCA on unscaled counts will allow us to see the extent to which sequencing depth contributes to total variance (of log-transformed counts). Computing PCA on such a large matrix ~58,000 x ~660 takes a long time, so we subset to the top 20,000 genes by unsupervised variance (taking the variance of each row of the matrix, ignoring the technical and biological covariates).
We can notice two things from plotting PC1 over the column sums we previously calculated: There is a single sample which is an outlier in PC1, and otherwise, PC1 is capturing the differences between samples in sequencing depth.
library(matrixStats)
log.cts <- log(assay(rse) + 1)
rv <- rowVars(log.cts)
o <- order(rv,decreasing=TRUE)
system.time({
pc <- prcomp(t(log.cts[head(o,20000),]))
}) # takes ~1 min
## user system elapsed
## 19.819 0.787 21.472
plot(cspm, pc$x[,1])
outlier <- which(pc$x[,1] > 300)
points(cspm[outlier], pc$x[outlier,1], col="blue", cex=3)
By plotting 20 “normal” samples and the outlier sample identified above, we see that this sample has a very different distribution of log counts. Remember from the above plot, the outlier does not have the highest sequencing depth (as measured by total count).
idx <- order(cspm)[round(seq(from=1,to=length(cspm),length=20))]
boxplot(log.cts[,c(idx,outlier)], range=0)
Here, we recompute the PCA, removing the outlier. Where we can clearly see that PC1 is capturing sequencing depth, so we definitely want to remove this technical effect.
system.time({
pc <- prcomp(t(log.cts[head(o,20000),-outlier]))
}) # takes ~1 min
## user system elapsed
## 19.584 0.651 20.841
plot(cspm[-outlier], pc$x[,1])
We will discuss in class the reasons that the column sum is a bad estimator for a sequencing depth correction. Briefly, it is because the column sum is overly influenced by the genes with the very largest counts, whereas we are interested in determining the shift on the log scale that occurs across the entire range of counts. Most methods for computing sequencing depth, or equivalently, library size correction factors involve a robust estimator of the center of the log ratios. Two of the most used methods are the median of ratios, used by DESeq and DESeq2 and the trimmed mean of log ratios, used by edgeR. Here we will show how the median ratio method works for RNA-seq counts. (Note that these methods for finding technical scaling factors may work well for some sequencing datasets, such as bulk RNA-seq, but not with others, such as metagenomic data or single-cell RNA-seq.
We start by constructing a pseudo-reference sample, which
will be the geometric mean for each row. If a single sample has a 0,
this makes the geometric mean 0, and so this row will not be used for
calculating the correction factor. We can quickly compute the geometric
mean of each row using log
, then rowMean
then
exp
(this is equivalent to the n-th root of the product of
n terms).
Here I plot a single sample (the one with the highest column sum), against this pseudo-reference sample, using the MA plot code I showed before. The log ratios are shifted above the x-axis as expected.
reference <- exp(rowMeans(log(assay(rse))))
x <- log2(reference)
y <- log2(assay(rse)[,which.max(cspm)] + 1)
plot(.5*(x + y), y - x,
cex=.5, col=rgb(0,0,0,.2), pch=20)
abline(h=0, col="red", lwd=3)
Now we plot a histogram of the log ratios, and identify the median with a vertical line (excluding those rows in which the geometric mean was 0, and so the log was undefined).
hist(y - x, breaks=200, col="grey", xlim=c(-2,5))
median.lfc <- median((y - x)[is.finite(x)])
abline(v = median.lfc, col="blue", lwd=3)
Now we draw the MA plot again, with the line through the median log ratio calculated just now. The essential idea in the median ratio method is to scale the raw counts down by this factor – if we are modeling with a GLM, to include this as an offset in the model. Note that, it is not necessary that all points fall on the blue line. We expect deviations representing within-condition biological variation as well as differential expression across conditions. We simply want to remove the systematic shift, which we know is a technical artifact – if there was a biological systematic shift in expression levels for a sample, the experimenter needs to perform additional controls or to identify “housekeeping” genes which they believe do not change systematically. Finally, we note that we could have applied something like the quantile normalization we showed earlier in the exploration of microarray data. The scaling method is typically preferred for count data, as we maintain information about the precision of low vs high counts.
plot(.5*(x + y), y - x,
cex=.5, col=rgb(0,0,0,.2), pch=20)
abline(h=0, col="red", lwd=3)
abline(h=median.lfc, col="blue", lwd=3)
Instead of doing all this computation manually, we can use a
Bioconductor package DESeq2 to do it for us. We will explore
more sophisticated transformations in the next section, but for now we
will just divide each column by a scaling factor, add 1 and take the
logarithm (base 2). The plotPCA
automatically just looks at
the top genes by their unsupervised variance. We clearly see two groups
separating, and not by the center which prepared the sample.
Note: here, and in other functions, you may see the word normalization, which really just refers to scaling to remove the influence of the sequencing depth. I’ve tried to stop using the word normalization in this context as I’ve found it confuses collaborators. For example, the data is not normally distributed after scaling to adjust for sequencing depth, it is still right skewed. After appropriate transformation, the data are more symmetric (and we will discuss the topic of transformations more in the next lecture).
library(DESeq2)
dds <- DESeqDataSet(rse, ~1)
## converting counts to integer mode
ntd <- normTransform(dds)
plotPCA(ntd, c("center"))
If we plot by the human population of the RNA-seq sample donors, we see this explains PC1, but not PC2, where we see the clear separation.
plotPCA(ntd, c("population"))
A reasonable guess is that PC2 represents the differences between the sexes of the donors. We don’t have this information in the metadata, but we can make a rough guess of sex by summing up the counts of genes expressed on the Y chromosome.
Note: this is just a rough guess of the sex of the donors, for demonstration of exploring this data. You wouldn’t use such a method for a final analysis. Another important note is that the Y chromosome gene expression has nothing to do with the gender or self-identification of the donors, and that binarizing a variable like sex is an oversimplification. When talking about the sex chromosomes, or biological sex, it is important to use the correct term “sex” and not “gender”.
We find out which rowRanges
of the dataset are on
chrY
, and then compute the sum of scaled counts from those
genes:
chrY <- as.logical(seqnames(rowRanges(ntd)) == "chrY")
dds <- estimateSizeFactors(dds)
chrYexprs <- colSums(counts(dds, normalized=TRUE)[chrY,])
hist(chrYexprs, breaks=100, col="grey")
It’s a bit more clear on the log scale that there are two clusters:
hist(log10(chrYexprs), breaks=100, col="grey")
We make an estimate of sex by setting an arbitrary cut-point, and we see that this roughly divides the groups of samples we saw as PC2:
ntd$male <- log10(chrYexprs) > 3.5
plotPCA(ntd, intgroup="male")
We add our estimate of sex to the colData
of the
dataset, and save this object, so we can continue working with this data
later. The saved version should be about 40 Mb.
dds$male <- log10(chrYexprs) > 3.5
table(dds$male)
##
## FALSE TRUE
## 331 332
save(dds, file="geuvadis.rda")
Let’s subset to just the female samples, and recalculate a PCA. We don’t see much clustering by the sequencing center overall:
dds.f <- dds[,!dds$male]
ntd.f <- normTransform(dds.f)
plotPCA(ntd.f, c("center"))
But we can see clear clustering by the human populations:
plotPCA(ntd.f, c("population"))
Finally, I note that, although we didn’t see clustering by sequencing center in the top two PCs above, there is still some technical variation present in the scaled counts, associated with sequencing center. For example, we can run F-tests on each row to find association of log scaled counts with center. We then plot the log scaled counts for the gene with the smallest p-value (uncorrected).
library(genefilter)
tests <- rowFtests(assay(ntd.f), ntd.f$center)
top.gene.log.exprs <- assay(ntd.f)[which.min(tests$p.value),]
plot(top.gene.log.exprs ~ ntd.f$center)
You can see from a histogram of p-values over all genes, that there is an enrichment of small p-values for certain genes. We will see in a later section why the measurements for these genes show an association with sequencing center.
hist(tests$p.value, col="grey")
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] genefilter_1.79.0 DESeq2_1.37.4 readr_2.1.2
## [4] here_1.0.1 SummarizedExperiment_1.27.1 Biobase_2.57.1
## [7] GenomicRanges_1.49.0 GenomeInfoDb_1.33.3 IRanges_2.31.0
## [10] S4Vectors_0.35.1 BiocGenerics_0.43.0 MatrixGenerics_1.9.1
## [13] matrixStats_0.62.0 pkgdown_2.0.5 testthat_3.1.4
## [16] rmarkdown_2.14 devtools_2.4.3 usethis_2.1.6
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 fs_1.5.2 bit64_4.0.5 RColorBrewer_1.1-3
## [5] httr_1.4.3 rprojroot_2.0.3 tools_4.2.1 bslib_0.3.1
## [9] utf8_1.2.2 R6_2.5.1 colorspace_2.0-3 DBI_1.1.3
## [13] tidyselect_1.1.2 prettyunits_1.1.1 processx_3.6.1 bit_4.0.4
## [17] compiler_4.2.1 cli_3.3.0 DelayedArray_0.23.0 labeling_0.4.2
## [21] sass_0.4.1 scales_1.2.0 callr_3.7.0 stringr_1.4.0
## [25] digest_0.6.29 XVector_0.37.0 pkgconfig_2.0.3 htmltools_0.5.2
## [29] sessioninfo_1.2.2 fastmap_1.1.0 highr_0.9 rlang_1.0.3
## [33] rstudioapi_0.13 RSQLite_2.2.14 farver_2.1.1 generics_0.1.3
## [37] jquerylib_0.1.4 jsonlite_1.8.0 BiocParallel_1.31.10 vroom_1.5.7
## [41] dplyr_1.0.9 RCurl_1.98-1.7 magrittr_2.0.3 GenomeInfoDbData_1.2.8
## [45] Matrix_1.5-0 munsell_0.5.0 Rcpp_1.0.8.3 fansi_1.0.3
## [49] lifecycle_1.0.1 stringi_1.7.6 yaml_2.3.5 zlibbioc_1.43.0
## [53] brio_1.1.3 pkgbuild_1.3.1 grid_4.2.1 blob_1.2.3
## [57] parallel_4.2.1 crayon_1.5.1 lattice_0.20-45 Biostrings_2.65.1
## [61] splines_4.2.1 annotate_1.75.0 hms_1.1.1 KEGGREST_1.37.2
## [65] locfit_1.5-9.5 knitr_1.39 ps_1.7.1 pillar_1.7.0
## [69] geneplotter_1.75.0 codetools_0.2-18 pkgload_1.3.0 XML_3.99-0.10
## [73] glue_1.6.2 evaluate_0.15 remotes_2.4.2 vctrs_0.4.1
## [77] png_0.1-7 tzdb_0.3.0 gtable_0.3.0 purrr_0.3.4
## [81] assertthat_0.2.1 ggplot2_3.3.6 cachem_1.0.6 xfun_0.31
## [85] xtable_1.8-4 survival_3.3-1 tibble_3.1.7 AnnotationDbi_1.59.1
## [89] memoise_2.0.1 ellipsis_0.3.2