Here we will examine distances a little more, including looking into hierarchical cluster, which is a useful technique for unsupervised analysis of high-throughput data. We load the GEUVADIS RNA-seq dataset we prepared in distances.

library(DESeq2)
load("geuvadis.rda")
dds.f <- dds[,!dds$male]

Picking a transformation

Last time I showed how we could scale the data to account for sequencing depth differences, and then take the log of the scaled counts. Here I want to emphasize that the logarithm, while convenient, is not necessarily an optimal transformation for computing distances between samples. We use the normTransform function to produce log scaled counts, and then we use the meanSdPlot function to plot the standard deviation of the log scaled counts for each gene over the mean:

ntd <- normTransform(dds.f)
library(vsn)
meanSdPlot(assay(ntd), ranks=FALSE)

This plot makes clear that, even after the log transform, the data exhibit systematic: heteroskedasticity. We should not expect that all points fall on a horizontal line, but the red line shows that there is a systematic trend of variances on the mean. In particular, low counts, which are associated with less precision for calculating multiplicative changes between samples, have higher standard deviation, and so contribute more to a distance computation than high counts.

You can see this increased spread on the left side of an MA plot:

x <- assay(ntd)[,1]
y <- assay(ntd)[,2]
plot(.5*(x + y), y - x,
     cex=.5, col=rgb(0,0,0,.1), pch=20)
abline(h=0, col="red", lwd=3)

A better transformation than log is to compute the variance stabilizing transformation. An implementation of this for count data is the vst function in the DESeq2 package. This function also incorporates scaling for sequencing depth. We can examine how these two diagnostic plots change. The meanSdPlot has less of a systematic “hump” on the left side. Note: we do not seek a transformation which puts all the genes on a horizontal line – we expect high standard deviation for a subset of genes which show true biological differences across the samples. We only seek to remove systematic trends in the variance over the mean.

vsd <- vst(dds.f)
meanSdPlot(assay(vsd), ranks=FALSE)

We can also see from an MA plot of two samples, that the increased spread of data on the left side has been reduced – now genes across the entire range have roughly equal “chance” at contributed to differences between samples.

x <- assay(vsd)[,1]
y <- assay(vsd)[,2]
plot(.5*(x + y), y - x,
     cex=.5, col=rgb(0,0,0,.1), pch=20)
abline(h=0, col="red", lwd=3)

To drive the point home, here I calculate the differences between two samples, and square them. These squared differences are the elements of a Euclidean distance computation. I also calculate the average signal strength on the log scale (the x-axis in the first MA plot above).

sq.diffs <- (assay(ntd)[,2] - assay(ntd)[,1])^2
ave <- .5 * (assay(ntd)[,2] + assay(ntd)[,1])
plot(ave, sq.diffs)

Note that, if we just take the logarithm of scaled counts, genes with average signal less than 5 (32 on the count scale) contribute three times more to the sum than genes with average signal higher than 5. The inequality even holds for genes with average signal less than 3 (8 on the count scale).

sum(sq.diffs[ave < 5])/sum(sq.diffs[ave > 5])
## [1] 2.983881
sum(sq.diffs[ave < 3])/sum(sq.diffs[ave > 3])
## [1] 1.374198

Drawing cluster dendrograms

For the rest of the document, I focus on drawing and working with hierachical clustering results. Let’s start with the basic code. We first subset to the top genes by variance, and then compute sample-sample distances of the variance stabilized data using the dist function:

library(matrixStats)
rv <- rowVars(assay(vsd))
o <- order(rv,decreasing=TRUE)
dists <- dist(t(assay(vsd)[head(o,500),]))

The hierarchical clustering function in R is hclust, with a number of options you can read about in ?hclust. The default method is complete. From the man page:

Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.

hc <- hclust(dists)
plot(hc, labels=vsd$sample)

This plot is pretty busy, and it’s hard to see what’s going on. One thing I notice immediately, is that there are two of each sample, clustered closely together: these are technical replicates, which makes sense that they would be so close in the dendrogram. Better to see what’s going on is to color the labels.

To get a better sense, let’s zoom into a single population, and see how the samples cluster within this group.

library(magrittr)
## 
## Attaching package: 'magrittr'
## The following objects are masked from 'package:testthat':
## 
##     equals, is_less_than, not
table(vsd$population)
## 
## CEU FIN GBR TSI YRI 
##  71  71  54  48  87
idx <- vsd$population == "TSI"
vsd2 <- vsd[,idx]
vsd2$sample %<>% factor

As before, we compute distances and a hierarchical clustering, then we ask for the results to be converted to a dendrogram object:

dists <- dist(t(assay(vsd2)[head(o,500),]))
hc <- hclust(dists)
dend <- as.dendrogram(hc)

Now we will use the dendextend package to color and annotate the dendrogram. Unlike above, after we’ve computed the dendrogram, we need to supply labels and colors in the dendrogram order not in the original data order. We therefore ask for the dendrogram order and apply this order to labels and colors when we want to modify these. Note that the four pairs of technical replicates often cluster tightly together.

suppressPackageStartupMessages(library(dendextend))
library(RColorBrewer)
palette(brewer.pal(8, "Dark2"))
o.dend <- order.dendrogram(dend)
labels(dend) <- vsd2$sample[o.dend]
labels_colors(dend) <- as.integer(vsd2$sample[o.dend])
plot(dend)

Note that, when we zoom into a single population, we can start to see the clustering of samples according to the sequencing center in which the samples were prepared. This indicates technical variation in the measurements.

plotPCA(vsd2, "center")

Finally, we show how to use hierarchical clustering to produce a distinct set of clusters, using the cutree function:

vsd2$cluster <- cutree(hc, k=5)
head(vsd2$cluster)
## ERR188398 ERR188110 ERR188253 ERR188384 ERR188254 ERR188357 
##         1         2         2         3         4         2
table(vsd2$cluster)
## 
##  1  2  3  4  5 
##  5 32  5  4  2

We can show these clusters on the dendrogram by re-coloring the labels:

labels_colors(dend) <- as.integer(vsd2$cluster[o.dend])
plot(dend)

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2          dendextend_1.14.0           magrittr_2.0.1             
##  [4] vsn_3.58.0                  DESeq2_1.30.0               SummarizedExperiment_1.20.0
##  [7] Biobase_2.50.0              MatrixGenerics_1.2.0        matrixStats_0.57.0         
## [10] GenomicRanges_1.42.0        GenomeInfoDb_1.26.2         IRanges_2.24.1             
## [13] S4Vectors_0.28.1            BiocGenerics_0.36.0         testthat_3.0.1             
## [16] rmarkdown_2.6               devtools_2.3.2              usethis_2.0.0              
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6           fs_1.5.0               bit64_4.0.5            httr_1.4.2            
##  [5] rprojroot_2.0.2        tools_4.0.3            R6_2.5.0               affyio_1.60.0         
##  [9] DBI_1.1.0              colorspace_2.0-0       withr_2.3.0            gridExtra_2.3         
## [13] tidyselect_1.1.0       prettyunits_1.1.1      processx_3.4.5         preprocessCore_1.52.0 
## [17] bit_4.0.4              compiler_4.0.3         cli_2.2.0              desc_1.2.0            
## [21] DelayedArray_0.16.0    labeling_0.4.2         scales_1.1.1           hexbin_1.28.1         
## [25] genefilter_1.72.0      affy_1.68.0            callr_3.5.1            stringr_1.4.0         
## [29] digest_0.6.27          XVector_0.30.0         pkgconfig_2.0.3        htmltools_0.5.0       
## [33] sessioninfo_1.1.1      limma_3.46.0           rlang_0.4.9            RSQLite_2.2.1         
## [37] farver_2.0.3           generics_0.1.0         BiocParallel_1.24.1    dplyr_1.0.2           
## [41] RCurl_1.98-1.2         GenomeInfoDbData_1.2.4 Matrix_1.3-0           Rcpp_1.0.5            
## [45] munsell_0.5.0          fansi_0.4.1            viridis_0.5.1          lifecycle_0.2.0       
## [49] stringi_1.5.3          yaml_2.2.1             zlibbioc_1.36.0        pkgbuild_1.2.0        
## [53] grid_4.0.3             blob_1.2.1             crayon_1.3.4           lattice_0.20-41       
## [57] splines_4.0.3          annotate_1.68.0        locfit_1.5-9.4         knitr_1.30            
## [61] ps_1.5.0               pillar_1.4.7           geneplotter_1.68.0     codetools_0.2-18      
## [65] pkgload_1.1.0          XML_3.99-0.5           glue_1.4.2             evaluate_0.14         
## [69] BiocManager_1.30.10    remotes_2.2.0          vctrs_0.3.6            gtable_0.3.0          
## [73] purrr_0.3.4            assertthat_0.2.1       ggplot2_3.3.3          xfun_0.19             
## [77] xtable_1.8-4           viridisLite_0.3.0      survival_3.2-7         tibble_3.0.4          
## [81] AnnotationDbi_1.52.0   memoise_1.1.0          ellipsis_0.3.1