Computing distances for high-throughput data

Author

Here we will explore a number of issues involved in comparing samples in high-throughput data, and in computing distances between samples. The main issues I want to introduce are:

An ideal dataset for exploring these issues is a large dataset with both technical and biological aspects recorded as metadata. It’s hard with few samples to discern what are technical differences, and what is simply biological variability among the samples in a condition. Note that proper data analysis requires that biological aspects of experiments be randomized or blocked with respect to technical aspects. Below we will show the number of significant differences we find when we simply compare measurements across the sequencing center in which the samples were processed.

Here we will work with the GEUVADIS project data, which contains 465 unique RNA-seq samples. The RNA was extracted from lymphoblastoid cell line samples from 5 populations of the 1000 Genomes Project: the CEPH (CEU), Finns (FIN), British (GBR), Toscani (TSI) and Yoruba (YRI). The CEPH population consists of Utah residents with Northern and Western European ancestry.

We start by downloading the summarized data from recount2 (this downloads ~75 Mb).

library(here)
library(SummarizedExperiment)
url <- "http://duffel.rail.bio/recount/ERP001942/rse_gene.Rdata"
file <- here("bioc","geuvadis_recount2.rda")
if (!file.exists(file)) download.file(url, file)
load(file)
source(here("bioc","my_scale_counts.R"))
rse <- my_scale_counts(rse_gene)

Now we read the GEUVADIS phenotypic data from a TSV file. This contains additional data not present in colData of the SummarizedExperiment.

library(readr)

Attaching package: 'readr'
The following objects are masked from 'package:testthat':

    edition_get, local_edition
pheno <- read_tsv(here("dist","geuvadis.tsv"))
Rows: 667 Columns: 25
── Column specification ────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (19): BioSample_s, Center_Name_s, Experiment_s, Library_Name_s, Run_s, SRA_Sample_s, Sample...
dbl   (4): AvgSpotLen_l, MBases_l, MBytes_l, InsertSize_l
date  (2): LoadDate_s, ReleaseDate_s

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df <- data.frame(pheno[,c("Run_s","Center_Name_s","population_s")])
df <- data.frame(lapply(df, factor))
colnames(df) <- c("run","center","population")
levels(df$center)
[1] "Centre for Genomic Regulation, Barcelona"                                           
[2] "Centro Nacional de Análisis Genómico"                                               
[3] "Helmholtz Zentrum München, German Research Center for Environmental Health, Germany"
[4] "Institute of Clinical Molecular Biology, University of Kiel, Germany"               
[5] "Leiden University Medical Center"                                                   
[6] "Max Planck Institute for Molecular Genetics"                                        
[7] "University of Geneva"                                                               
[8] "Uppsala University"                                                                 
levels(df$center) <- c("CGR","CNAG","HZM","ICMB","LUMC","MPIMG","UG","UU")
head(df)
        run center population
1 ERR188398  MPIMG        TSI
2 ERR188110  MPIMG        TSI
3 ERR188253   ICMB        TSI
4 ERR188407  MPIMG        GBR
5 ERR188396     UG        CEU
6 ERR204857     UG        FIN

We have more rows in the TSV than samples in rse. And there appear to be technical replicates, as there are 465 samples in GEUVADIS according to the project website.

nrow(df)
[1] 667
ncol(rse)
[1] 663

Here I make sure that all the samples in rse are present in df, and then match up the data in df and add it to the colData of rse:

all(colnames(rse) %in% df$run)
[1] TRUE
m <- match(colnames(rse), df$run)
rse$center <- df$center[m]
rse$population <- df$population[m]

Sequencing depth as a technical artifact

The total number of counts per sample is a technical artifact, which we need to control for before computing distances. Let’s compute the sum of each column, and then divide by 1 million (we can call this “column sum per million”). This has a large range, from the sample with the least counts to the sample with the most counts.

The number of counts (either reads or pairs of reads which are referred to as fragments) is known as the sample’s sequencing depth.

cspm <- colSums(assay(rse))/1e6
range(cspm)
[1]  8.550355 77.576739

Here I pick two samples in the dataset, with low and with high sequencing depth:

idx <- order(cspm)[c(100,500)]
cspm[idx]
ERR205017 ERR188420 
 15.85876  31.59642 

Let’s compare the counts for these two samples by making a simple scatter plot:

plot(assay(rse)[,idx])

Clearly, the log is useful here, as all the points are clustered near the origin. It’s also useful to make the points somewhat transparent using rgb so we can see how they overlay. Finally we add a line that indicates y=x.

plot(log10(assay(rse)[,idx] + 1), 
     cex=.5, col=rgb(0,0,0,.1), pch=20)
abline(0,1,col="red", lwd=3)

MA plot

The problem with the scatterplot above is that, we really care about the differences between these two samples, but those differences are on a 45 degree angle. A common plot in genomics is to put the difference itself on the y-axis (typical the difference in log-scale measurements, so the log fold change). The x-axis can then be the average between the two values, whether on the original or log scale. This is called an “MA-plot” for minus over average, or a Bland-Altman plot. Note that we can now directly see the extent of the differences between these samples. And we can easily see the systematic technical factor associated with higher sequencing depth in the second sample.

x <- log2(assay(rse)[,idx[1]] + 1)
y <- log2(assay(rse)[,idx[2]] + 1)
plot(.5*(x + y), y - x,
     cex=.5, col=rgb(0,0,0,.1), pch=20)
abline(h=0, col="red", lwd=3)

Remember, if we were to calculate Euclidean distances between samples on the log scale, we would be calculating the differences drawn below, squaring and summing them. So we definitely want to center these differences on the x-axis here, as we know the systematic difference is due solely to sequencing depth.

While we’re making this plot, also note the inflation of differences on the left side of the plot. This increase in variance depending on signal strength – called heteroskedasticity in statistics – is a technical artifact from the inherent count-based measurements of high-throughput sequencing data, and needs to be dealt with to compute biologically meaningful distances between samples.

plot(.5*(x + y), y - x, type="h")

Principle component analysis

As this is a graduate course in biostatistics, I’m assuming some familiarity with principle component analysis. PCA is an invaluable technique for high-throughput data, to capture in a 2D plot the inter-relationships between samples in the dataset. Note that multidimensional scaling (MDS) on a Euclidean distance matrix produces an equivalent plot as PCA.

Performing PCA on unscaled counts will allow us to see the extent to which sequencing depth contributes to total variance (of log-transformed counts). Computing PCA on such a large matrix ~58,000 x ~660 takes a long time, so we subset to the top 20,000 genes by unsupervised variance (taking the variance of each row of the matrix, ignoring the technical and biological covariates).

We can notice two things from plotting PC1 over the column sums we previously calculated: There is a single sample which is an outlier in PC1, and otherwise, PC1 is capturing the differences between samples in sequencing depth.

library(matrixStats)
log.cts <- log(assay(rse) + 1)
rv <- rowVars(log.cts)
o <- order(rv,decreasing=TRUE)
system.time({
  pc <- prcomp(t(log.cts[head(o, 10000),]))
}) # takes ~10 seconds
   user  system elapsed 
  7.656   0.141   7.921 
plot(cspm, pc$x[,1])
outlier <- which(pc$x[,1] > 200)
points(cspm[outlier], pc$x[outlier,1], col="blue", cex=3)

By plotting 20 “normal” samples and the outlier sample identified above, we see that this sample has a very different distribution of log counts. Remember from the above plot, the outlier does not have the highest sequencing depth (as measured by total count).

idx <- order(cspm)[round(seq(from=1,to=length(cspm),length=20))]
boxplot(log.cts[,c(idx,outlier)], range=0)

Here, we recompute the PCA, removing the outlier. Where we can clearly see that PC1 is capturing sequencing depth, so we definitely want to remove this technical effect.

system.time({
  pc <- prcomp(t(log.cts[head(o, 10000),-outlier]))
}) # takes ~10 seconds
   user  system elapsed 
  7.686   0.138   8.032 
plot(cspm[-outlier], pc$x[,1])

Scaling for sequencing depth

We will discuss in class the reasons that the column sum is a bad estimator for a sequencing depth correction. Briefly, it is because the column sum is overly influenced by the genes with the very largest counts, whereas we are interested in determining the shift on the log scale that occurs across the entire range of counts. Most methods for computing sequencing depth, or equivalently, library size correction factors involve a robust estimator of the center of the log ratios. Two of the most used methods are the median of ratios, used by DESeq and DESeq2 and the trimmed mean of log ratios, used by edgeR. Here we will show how the median ratio method works for RNA-seq counts. (Note that these methods for finding technical scaling factors may work well for some sequencing datasets, such as bulk RNA-seq, but not with others, such as metagenomic data or single-cell RNA-seq.

We start by constructing a pseudo-reference sample, which will be the geometric mean for each row. If a single sample has a 0, this makes the geometric mean 0, and so this row will not be used for calculating the correction factor. We can quickly compute the geometric mean of each row using log, then rowMean then exp (this is equivalent to the n-th root of the product of n terms).

Here I plot a single sample (the one with the highest column sum), against this pseudo-reference sample, using the MA plot code I showed before. The log ratios are shifted above the x-axis as expected.

reference <- exp(rowMeans(log(assay(rse))))
x <- log2(reference)
y <- log2(assay(rse)[,which.max(cspm)] + 1)
plot(.5*(x + y), y - x,
     cex=.5, col=rgb(0,0,0,.2), pch=20)
abline(h=0, col="red", lwd=3)

Now we plot a histogram of the log ratios, and identify the median with a vertical line (excluding those rows in which the geometric mean was 0, and so the log was undefined).

hist(y - x, breaks=200, col="grey", xlim=c(-2,5))
median.lfc <- median((y - x)[is.finite(x)])
abline(v = median.lfc, col="blue", lwd=3)

Now we draw the MA plot again, with the line through the median log ratio calculated just now. The essential idea in the median ratio method is to scale the raw counts down by this factor – if we are modeling with a GLM, to include this as an offset in the model. Note that, it is not necessary that all points fall on the blue line. We expect deviations representing within-condition biological variation as well as differential expression across conditions. We simply want to remove the systematic shift, which we know is a technical artifact – if there was a biological systematic shift in expression levels for a sample, the experimenter needs to perform additional controls or to identify “housekeeping” genes which they believe do not change systematically. Finally, we note that we could have applied something like the quantile normalization we showed earlier in the exploration of microarray data. The scaling method is typically preferred for count data, as we maintain information about the precision of low vs high counts.

plot(.5*(x + y), y - x,
     cex=.5, col=rgb(0,0,0,.2), pch=20)
abline(h=0, col="red", lwd=3)
abline(h=median.lfc, col="blue", lwd=3)

PCA after scaling

Instead of doing all this computation manually, we can use a Bioconductor package DESeq2 to do it for us. We will explore more sophisticated transformations in the next section, but for now we will just divide each column by a scaling factor, add 1 and take the logarithm (base 2). The plotPCA automatically just looks at the top genes by their unsupervised variance. We clearly see two groups separating, and not by the center which prepared the sample.

Note: here, and in other functions, you may see the word normalization, which really just refers to scaling to remove the influence of the sequencing depth. I’ve tried to stop using the word normalization in this context as I’ve found it confuses collaborators. For example, the data is not normally distributed after scaling to adjust for sequencing depth, it is still right skewed. After appropriate transformation, the data are more symmetric (and we will discuss the topic of transformations more in the next lecture).

library(DESeq2)
dds <- DESeqDataSet(rse, ~1)
converting counts to integer mode
dds <- estimateSizeFactors(dds)
ntd <- normTransform(dds)
plotPCA(ntd, c("center"), ntop=2000)
using ntop=2000 top features by variance

Not much variance explained in the first two PCs by center, what about the human population of the donor:

plotPCA(ntd, c("population"), ntop=2000)
using ntop=2000 top features by variance

So it seems there is a good amount of biological variance due to the human population of the donor in the top 2,000 features ranked by variance.

We can recompute this manually (the same code is run inside of plotPCA):

rv <- rowVars(assay(ntd))
o <- order(rv, decreasing=TRUE)
system.time({
  pc <- prcomp(t(assay(ntd)[head(o, 2000),]))
}) # takes ~2 seconds
   user  system elapsed 
  1.903   0.029   1.951 
plot(pc$x[,1:2], col=as.numeric(ntd$population) + 1, pch=20)

Let’s examine the distribution of the PCs:

library(tidyr)
library(dplyr)
pc_long <- pc$x[,1:9] |>
  as.data.frame() |>
  pivot_longer(cols = everything(), 
               names_to = "PC", 
               values_to = "score")
library(ggplot2)
ggplot(pc_long, aes(x = score)) +
  geom_density(alpha = 0.3) +
  facet_wrap(~ PC)

PC3 seems to be related to expression of genes on sex chromosomes:

chrY <- as.logical(seqnames(rowRanges(dds)) == "chrY")
chrYexprs <- colSums(counts(dds, normalized=TRUE)[chrY,])
plot(pc$x[,3], chrYexprs)

We save the dataset for use in later documents.

save(dds, file=here("bioc","geuvadis.rda"))

Although we didn’t see clustering by sequencing center in the top two PCs above, there is still some technical variation present in the scaled counts, associated with sequencing center. For example, we can run F-tests on each row to find association of log scaled counts with center. We then plot the log scaled counts for the gene with the smallest p-value (uncorrected).

library(genefilter)
tests <- rowFtests(assay(ntd), ntd$center)
top.gene.log.exprs <- assay(ntd)[which.min(tests$p.value),]
plot(top.gene.log.exprs ~ ntd$center)

You can see from a histogram of p-values over all genes, that there is an enrichment of small p-values for certain genes. We will see in a later section why the measurements for these genes show an association with sequencing center.

hist(tests$p.value, col="grey")

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] genefilter_1.86.0           ggplot2_3.5.1               dplyr_1.1.4                
 [4] tidyr_1.3.1                 DESeq2_1.44.0               readr_2.1.5                
 [7] SummarizedExperiment_1.34.0 Biobase_2.64.0              GenomicRanges_1.56.1       
[10] GenomeInfoDb_1.40.1         IRanges_2.38.1              S4Vectors_0.42.1           
[13] BiocGenerics_0.50.0         MatrixGenerics_1.16.0       matrixStats_1.3.0          
[16] here_1.0.1                  testthat_3.2.1.1            rmarkdown_2.27             
[19] devtools_2.4.5              usethis_3.0.0              

loaded via a namespace (and not attached):
 [1] DBI_1.2.3               remotes_2.5.0           rlang_1.1.4             magrittr_2.0.3         
 [5] RSQLite_2.3.7           compiler_4.4.1          png_0.1-8               vctrs_0.6.5            
 [9] stringr_1.5.1           profvis_0.3.8           pkgconfig_2.0.3         crayon_1.5.3           
[13] fastmap_1.2.0           XVector_0.44.0          ellipsis_0.3.2          labeling_0.4.3         
[17] utf8_1.2.4              promises_1.3.0          sessioninfo_1.2.2       tzdb_0.4.0             
[21] UCSC.utils_1.0.0        purrr_1.0.2             bit_4.0.5               xfun_0.46              
[25] zlibbioc_1.50.0         cachem_1.1.0            jsonlite_1.8.8          blob_1.2.4             
[29] later_1.3.2             DelayedArray_0.30.1     BiocParallel_1.38.0     parallel_4.4.1         
[33] R6_2.5.1                stringi_1.8.4           pkgload_1.4.0           brio_1.1.5             
[37] Rcpp_1.0.13             knitr_1.48              splines_4.4.1           httpuv_1.6.15          
[41] Matrix_1.7-0            tidyselect_1.2.1        rstudioapi_0.16.0       abind_1.4-5            
[45] yaml_2.3.10             codetools_0.2-20        miniUI_0.1.1.1          pkgbuild_1.4.4         
[49] lattice_0.22-6          tibble_3.2.1            KEGGREST_1.44.1         shiny_1.9.1            
[53] withr_3.0.1             evaluate_0.24.0         survival_3.7-0          urlchecker_1.0.1       
[57] Biostrings_2.72.1       pillar_1.9.0            generics_0.1.3          vroom_1.6.5            
[61] rprojroot_2.0.4         hms_1.1.3               munsell_0.5.1           scales_1.3.0           
[65] xtable_1.8-4            glue_1.7.0              tools_4.4.1             annotate_1.82.0        
[69] locfit_1.5-9.10         fs_1.6.4                XML_3.99-0.17           grid_4.4.1             
[73] AnnotationDbi_1.66.0    colorspace_2.1-1        GenomeInfoDbData_1.2.12 cli_3.6.3              
[77] fansi_1.0.6             S4Arrays_1.4.1          gtable_0.3.5            digest_0.6.36          
[81] SparseArray_1.4.8       htmlwidgets_1.6.4       farver_2.1.2            memoise_2.0.1          
[85] htmltools_0.5.8.1       lifecycle_1.0.4         httr_1.4.7              mime_0.12              
[89] bit64_4.0.5