Here we will show some very simple explorations of high-dimensional
data typical of computational biology. This particular study has a study
ID SRP012682
and it happens to be the Genotype-Tissue
Expression Project or GTEx.
You can look up the study on the SRA (Sequence Read Archive) or the ENA (European Nucleotide Archive) and find a description:
The aim of the Genotype-Tissue Expression (GTEx) Project is to increase our understanding of how changes in our genes affect human health and disease with the ultimate goal of improving health care for future generations. GTEx will create a database that researchers can use to study how inherited changes in genes lead to common diseases. GTEx researchers are studying genes in different tissues obtained from many different people….
There is a computational project, called recount2, which performs a basic summarization of public data sets with gene expression data. There are numerous competing methods for computing gene expression from raw data, which we will cover later in the course. For now, let’s just load the summarized table computed by recount2. This gives us, for each gene and each sample, a measurement of gene expression. We can use these measurements to compare samples, and we will discuss distances and normalization as a future topic.
First we need to install the SummarizedExperiment R package, which is hosted on Bioconductor:
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("SummarizedExperiment")
We can then download the summarized experiment data (note, it is 200 Mb so it may take a minute):
url <- "http://duffel.rail.bio/recount/SRP012682/rse_gene_brain.Rdata"
file <- "brain.rda"
if (!file.exists(file)) download.file(url, file)
load(file)
I’ve written a simple function which scales the gene expression data. It should roughly be counts of molecules detected for each gene and for each sample. It takes maybe half a minute to do the scaling.
library(SummarizedExperiment)
library(here)
source(here("bioc","my_scale_counts.R"))
se <- my_scale_counts(rse_gene)
We will learn about this se
object later, but for now we
will just extract the information we need to do some EDA (exploratory
data analysis). There are smarter ways to work with these objects than
extracting the matrices, which we will see next week when we explore
Bioconductor objects.
We have ~60,000 genes and ~1400 samples.
cts <- assay(se)
dim(cts)
## [1] 58037 1409
The “condition” for the samples is part of the “metadata” of the
se
object. Like I said, we will explore this in a later
section, but for now we pull out the information we need:
condition <- se$smtsd
head(condition)
## [1] "Brain - Cerebellar Hemisphere"
## [2] "Brain - Cerebellar Hemisphere"
## [3] "Brain - Amygdala"
## [4] "Brain - Cortex"
## [5] "Brain - Nucleus accumbens (basal ganglia)"
## [6] "Brain - Cerebellar Hemisphere"
We’ll do a little string processing, to chop off the redundant information:
library(magrittr)
# substitute "Brain - ..." with "..."
condition %<>% (function(x) sub("Brain - (.*)", "\\1", x))
condition %<>% factor
table(condition)
## condition
## Amygdala Anterior cingulate cortex (BA24)
## 81 99
## Caudate (basal ganglia) Cerebellar Hemisphere
## 134 118
## Cerebellum Cortex
## 145 132
## Frontal Cortex (BA9) Hippocampus
## 120 103
## Hypothalamus Nucleus accumbens (basal ganglia)
## 104 123
## Putamen (basal ganglia) Spinal cord (cervical c-1)
## 103 76
## Substantia nigra
## 71
There are many rows which have so few measurements (counts), that we can safely ignore them for large-scale EDA (e.g. calculating distances between samples). These kind of filtering decisions are highly dataset-specific, you wouldn’t want to use this exact filter for another dataset, but here, it makes sense to ask that the row sum be larger than 100, because we have ~1400 samples.
dim(cts)
## [1] 58037 1409
rs <- rowSums(cts)
hist(log10(rs + 1))
abline(v=2, col="blue", lwd=3)
cts <- cts[rs > 100,]
dim(cts)
## [1] 48948 1409
A very simple plot is just to look at two samples’ gene expression values against each other in a scatterplot. Typically, we take the logarithm, as gene expression is highly skewed (many genes with low expression, a few with very high expression).
plot
can be used to make scatterplots of x
and y
or of a matrix with two columns, but with 10s of
thousands of observations it will be a bit slow.
smoothScatter
does a better job with so many data
points.
smoothScatter(cts[,1:2])
## Warning in KernSmooth::bkde2D(x, bandwidth = bandwidth, gridsize =
## nbin, : Binning grid too coarse for current (small) bandwidth: consider
## increasing 'gridsize'
logcts <- log10(cts + 1)
smoothScatter(logcts[,1:2])
Here you can get a sense for the distribution of counts (log-scale).
hist(logcts[,1])
A first pass for looking at the sample distances is to make a PCA plot. To speed up the calculation of the PCs, we first subset to the top 500 genes by unsupervised variance of log counts (we don’t look at the condition to which the sample belong, just look at the total variance).
library(matrixStats)
rv <- rowVars(logcts)
o <- order(rv, decreasing=TRUE)[1:500]
pc <- prcomp(t(logcts[o,]))
plot(pc$x[,1:2])
We can get a better sense by coloring the groups by condition.
library(ggplot2)
dat <- data.frame(pc$x[,1:2], condition)
ggplot(dat, aes(PC1, PC2, col=condition)) +
geom_point()
Let’s focus on the samples belonging to four specific regions of the brain:
regions <- c("Cerebellum","Cortex","Hippocampus","Hypothalamus")
condition_sub_idx <- condition %in% regions
condition_sub <- droplevels(condition[condition_sub_idx])
table(condition_sub_idx)
## condition_sub_idx
## FALSE TRUE
## 925 484
We can compute Euclidean distances between the log counts for these samples (afterwards we will discuss considerations for calculating distances for count data).
# distance compares rows,
# so we need to transpose the matrix
dists <- dist(t( logcts[o,condition_sub_idx] ))
Now we will put together pieces to visualize the distances as a heatmap. First, we convert the distances from a dist into a matrix:
class(dists)
## [1] "dist"
dist_mat <- as.matrix(dists)
We can plot these distances using the pheatmap package. The default colors are red, pale yellow, and blue. Let’s create a new color palette:
library(RColorBrewer)
colRamp <- colorRampPalette(brewer.pal(9,"Blues")) # returns a function
plot(1:100, col=colRamp(100), pch=20, main="Blues")
Let’s reverse, so that bigger distances will give lighter color, this will help to visualize similarity in the heatmap:
cols <- rev(colRamp(100))
We make a data.frame that matches the names of samples to condition:
anno_df <- data.frame(
condition=condition_sub,
row.names=colnames(dist_mat)
)
Putting it all together:
library(pheatmap)
pheatmap(dist_mat,
color=cols,
clustering_distance_rows=dists,
clustering_distance_cols=dists,
annotation_col=anno_df,
show_rownames=FALSE, show_colnames=FALSE)
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 11.5.1
##
## Matrix products: default
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices datasets utils methods
## [8] base
##
## other attached packages:
## [1] pheatmap_1.0.12 RColorBrewer_1.1-3
## [3] ggplot2_3.3.6 magrittr_2.0.3
## [5] here_1.0.1 SummarizedExperiment_1.27.1
## [7] Biobase_2.57.1 GenomicRanges_1.49.0
## [9] GenomeInfoDb_1.33.3 IRanges_2.31.0
## [11] S4Vectors_0.35.1 BiocGenerics_0.43.0
## [13] MatrixGenerics_1.9.1 matrixStats_0.62.0
## [15] pkgdown_2.0.5 testthat_3.1.4
## [17] rmarkdown_2.14 devtools_2.4.3
## [19] usethis_2.1.6
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.1 pkgload_1.3.0 jsonlite_1.8.0
## [4] bslib_0.3.1 brio_1.1.3 assertthat_0.2.1
## [7] highr_0.9 GenomeInfoDbData_1.2.8 yaml_2.3.5
## [10] remotes_2.4.2 sessioninfo_1.2.2 pillar_1.7.0
## [13] lattice_0.20-45 glue_1.6.2 digest_0.6.29
## [16] XVector_0.37.0 colorspace_2.0-3 htmltools_0.5.2
## [19] Matrix_1.4-1 pkgconfig_2.0.3 zlibbioc_1.43.0
## [22] purrr_0.3.4 scales_1.2.0 processx_3.6.1
## [25] tibble_3.1.7 farver_2.1.1 generics_0.1.3
## [28] ellipsis_0.3.2 withr_2.5.0 cachem_1.0.6
## [31] cli_3.3.0 crayon_1.5.1 memoise_2.0.1
## [34] evaluate_0.15 ps_1.7.1 fs_1.5.2
## [37] fansi_1.0.3 pkgbuild_1.3.1 tools_4.2.1
## [40] prettyunits_1.1.1 lifecycle_1.0.1 stringr_1.4.0
## [43] munsell_0.5.0 DelayedArray_0.23.0 callr_3.7.0
## [46] compiler_4.2.1 jquerylib_0.1.4 rlang_1.0.3
## [49] grid_4.2.1 RCurl_1.98-1.7 rstudioapi_0.13
## [52] labeling_0.4.2 bitops_1.0-7 gtable_0.3.0
## [55] DBI_1.1.3 R6_2.5.1 knitr_1.39
## [58] dplyr_1.0.9 fastmap_1.1.0 utf8_1.2.2
## [61] rprojroot_2.0.3 KernSmooth_2.23-20 stringi_1.7.6
## [64] vctrs_0.4.1 tidyselect_1.1.2 xfun_0.31