library(DESeq2)
load("geuvadis.rda")
Distance part 2: transformations and clustering
Here we will examine distances a little more, including looking into data transformations to stabilize variance, and hierarchical clustering, which is a useful technique for unsupervised analysis of high-throughput data. We load the GEUVADIS RNA-seq dataset we prepared in distances.
Picking a transformation
Last time I showed how we could scale the data to account for sequencing depth differences, and then take the log of the scaled counts. Here I want to emphasize that the logarithm, while convenient, is not necessarily an optimal transformation for computing distances between samples. We use the normTransform
function to produce log scaled counts, and then we use the meanSdPlot
function to plot the standard deviation of the log scaled counts for each gene over the mean:
<- normTransform(dds)
ntd library(vsn)
meanSdPlot(assay(ntd), ranks=FALSE)
This plot makes clear that, even after the log transform, the data exhibit systematic: heteroskedasticity. We should not expect that all points fall on a horizontal line, but the red line shows that there is a systematic trend of variances on the mean. In particular, low counts, which are associated with less precision for calculating multiplicative changes between samples, have higher standard deviation, and so contribute more to a distance computation than high counts.
You can see this increased spread on the left side of an MA plot:
<- assay(ntd)[,1]
x <- assay(ntd)[,2]
y plot(.5*(x + y), y - x,
cex=.5, col=rgb(0,0,0,.1), pch=20)
abline(h=0, col="red", lwd=3)
A better transformation than log is to compute the variance stabilizing transformation. An implementation of this for count data is the vst
function in the DESeq2 package. This function also incorporates scaling for sequencing depth. We can examine how these two diagnostic plots change. The meanSdPlot
has less of a systematic “hump” on the left side. Note: we do not seek a transformation which puts all the genes on a horizontal line – we expect high standard deviation for a subset of genes which show true biological differences across the samples. We only seek to remove systematic trends in the variance over the mean.
<- vst(dds)
vsd meanSdPlot(assay(vsd), ranks=FALSE)
We can also see from an MA plot of two samples, that the increased spread of data on the left side has been reduced – now genes across the entire range have roughly equal “chance” at contributed to differences between samples.
<- assay(vsd)[,1]
x <- assay(vsd)[,2]
y plot(.5*(x + y), y - x,
cex=.5, col=rgb(0,0,0,.1), pch=20)
abline(h=0, col="red", lwd=3)
To drive the point home, here I calculate the differences between two samples, and square them. These squared differences are the elements of a Euclidean distance computation. I also calculate the average signal strength on the log scale (the x-axis in the first MA plot above).
<- (assay(ntd)[,2] - assay(ntd)[,1])^2
sq.diffs <- .5 * (assay(ntd)[,2] + assay(ntd)[,1])
ave plot(ave, sq.diffs)
Note that, if we just take the logarithm of scaled counts, genes with average signal less than 5 (32 on the count scale) contribute three times more to the sum than genes with average signal higher than 5. The inequality even holds for genes with average signal less than 3 (8 on the count scale).
sum(sq.diffs[ave < 5])/sum(sq.diffs[ave > 5])
[1] 2.983881
sum(sq.diffs[ave < 3])/sum(sq.diffs[ave > 3])
[1] 1.374198
Drawing cluster dendrograms
For the rest of the document, I focus on drawing and working with hierachical clustering results. Let’s start with the basic code. We first subset to the top genes by variance, and then compute sample-sample distances of the variance stabilized data using the dist
function:
library(matrixStats)
<- rowVars(assay(vsd))
rv <- order(rv,decreasing=TRUE)
o <- dist(t(assay(vsd)[head(o,500),])) dists
The hierarchical clustering function in R is hclust
, with a number of options you can read about in ?hclust
. The default method
is complete
. From the man page:
Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.
<- hclust(dists)
hc plot(hc, labels=vsd$sample)
This plot is pretty busy, and it’s hard to see what’s going on. One thing I notice immediately, is that there are two of each sample, clustered closely together: these are technical replicates, which makes sense that they would be so close in the dendrogram. Better to see what’s going on is to color the labels.
To get a better sense, let’s zoom into a single population, and see how the samples cluster within this group.
library(magrittr)
Attaching package: 'magrittr'
The following object is masked from 'package:GenomicRanges':
subtract
The following objects are masked from 'package:testthat':
equals, is_less_than, not
table(vsd$population)
CEU FIN GBR TSI YRI
162 114 115 113 159
<- vsd$population == "TSI"
idx <- vsd[,idx]
vsd2 $sample %<>% factor vsd2
As before, we compute distances and a hierarchical clustering, then we ask for the results to be converted to a dendrogram object:
<- dist(t(assay(vsd2)[head(o,500),]))
dists <- hclust(dists)
hc <- as.dendrogram(hc) dend
Now we will use the dendextend package to color and annotate the dendrogram. Unlike above, after we’ve computed the dendrogram, we need to supply labels and colors in the dendrogram order not in the original data order. We therefore ask for the dendrogram order and apply this order to labels and colors when we want to modify these. Note that the four pairs of technical replicates often cluster tightly together.
suppressPackageStartupMessages(library(dendextend))
library(RColorBrewer)
palette(brewer.pal(8, "Dark2"))
<- order.dendrogram(dend)
o.dend labels(dend) <- vsd2$sample[o.dend]
labels_colors(dend) <- as.integer(vsd2$sample[o.dend])
plot(dend)
Note that, when we zoom into a single population, we can start to see the clustering of samples according to the sequencing center in which the samples were prepared. This indicates technical variation in the measurements.
plotPCA(vsd2, "center")
using ntop=500 top features by variance
Finally, we show how to use hierarchical clustering to produce a distinct set of clusters, using the cutree
function:
$cluster <- cutree(hc, k=5)
vsd2head(vsd2$cluster)
ERR188398 ERR188110 ERR188253 ERR188384 ERR188254 ERR188357
1 2 2 3 4 3
table(vsd2$cluster)
1 2 3 4 5
18 66 15 4 10
We can show these clusters on the dendrogram by re-coloring the labels:
labels_colors(dend) <- as.integer(vsd2$cluster[o.dend])
plot(dend)
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods base
other attached packages:
[1] RColorBrewer_1.1-3 dendextend_1.17.1 magrittr_2.0.3
[4] vsn_3.72.0 DESeq2_1.44.0 SummarizedExperiment_1.34.0
[7] Biobase_2.64.0 MatrixGenerics_1.16.0 matrixStats_1.3.0
[10] GenomicRanges_1.56.1 GenomeInfoDb_1.40.1 IRanges_2.38.1
[13] S4Vectors_0.42.1 BiocGenerics_0.50.0 testthat_3.2.1.1
[16] rmarkdown_2.27 devtools_2.4.5 usethis_3.0.0
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 viridisLite_0.4.2 farver_2.1.2 dplyr_1.1.4
[5] viridis_0.6.5 fastmap_1.2.0 promises_1.3.0 digest_0.6.36
[9] mime_0.12 lifecycle_1.0.4 ellipsis_0.3.2 statmod_1.5.0
[13] compiler_4.4.1 rlang_1.1.4 tools_4.4.1 utf8_1.2.4
[17] yaml_2.3.10 knitr_1.48 labeling_0.4.3 S4Arrays_1.4.1
[21] htmlwidgets_1.6.4 pkgbuild_1.4.4 DelayedArray_0.30.1 pkgload_1.4.0
[25] abind_1.4-5 BiocParallel_1.38.0 miniUI_0.1.1.1 withr_3.0.1
[29] purrr_1.0.2 grid_4.4.1 preprocessCore_1.66.0 fansi_1.0.6
[33] urlchecker_1.0.1 profvis_0.3.8 xtable_1.8-4 colorspace_2.1-1
[37] ggplot2_3.5.1 scales_1.3.0 cli_3.6.3 crayon_1.5.3
[41] generics_0.1.3 remotes_2.5.0 rstudioapi_0.16.0 httr_1.4.7
[45] sessioninfo_1.2.2 cachem_1.1.0 affy_1.82.0 stringr_1.5.1
[49] zlibbioc_1.50.0 parallel_4.4.1 BiocManager_1.30.23 XVector_0.44.0
[53] vctrs_0.6.5 Matrix_1.7-0 jsonlite_1.8.8 locfit_1.5-9.10
[57] limma_3.60.4 hexbin_1.28.3 affyio_1.74.0 glue_1.7.0
[61] codetools_0.2-20 stringi_1.8.4 gtable_0.3.5 later_1.3.2
[65] UCSC.utils_1.0.0 munsell_0.5.1 tibble_3.2.1 pillar_1.9.0
[69] htmltools_0.5.8.1 brio_1.1.5 GenomeInfoDbData_1.2.12 R6_2.5.1
[73] evaluate_0.24.0 shiny_1.9.1 lattice_0.22-6 memoise_2.0.1
[77] httpuv_1.6.15 Rcpp_1.0.13 gridExtra_2.3 SparseArray_1.4.8
[81] xfun_0.46 fs_1.6.4 pkgconfig_2.0.3