Here we will show some very simple explorations of high-dimensional data typical of computational biology. This particular study has a study ID SRP012682 and it happens to be the Genotype-Tissue Expression Project or GTEx.

You can look up the study on the SRA (Sequence Read Archive) or the ENA (European Nucleotide Archive) and find a description:

The aim of the Genotype-Tissue Expression (GTEx) Project is to increase our understanding of how changes in our genes affect human health and disease with the ultimate goal of improving health care for future generations. GTEx will create a database that researchers can use to study how inherited changes in genes lead to common diseases. GTEx researchers are studying genes in different tissues obtained from many different people….

There is a computational project, called recount2, which performs a basic summarization of public data sets with gene expression data. There are numerous competing methods for computing gene expression from raw data, which we will cover later in the course. For now, let’s just load the summarized table computed by recount2. This gives us, for each gene and each sample, a measurement of gene expression. We can use these measurements to compare samples, and we will discuss distances and normalization as a future topic.

First we need to install the SummarizedExperiment R package, which is hosted on Bioconductor:

if (!requireNamespace("BiocManager"))
  install.packages("BiocManager")
BiocManager::install("SummarizedExperiment")

We can then download the summarized experiment data (note, it is 200 Mb so it may take a minute):

url <- "http://duffel.rail.bio/recount/SRP012682/rse_gene_brain.Rdata"
file <- "brain.rda"
if (!file.exists(file)) download.file(url, file)
load(file)

I’ve written a simple function which scales the gene expression data. It should roughly be counts of molecules detected for each gene and for each sample. It takes maybe half a minute to do the scaling.

library(SummarizedExperiment)
library(here)
source(here("bioc","my_scale_counts.R"))
se <- my_scale_counts(rse_gene)

We will learn about this se object later, but for now we will just extract the information we need to do some EDA (exploratory data analysis). There are smarter ways to work with these objects than extracting the matrices, which we will see next week when we explore Bioconductor objects.

We have ~60,000 genes and ~1400 samples.

cts <- assay(se)
dim(cts)
## [1] 58037  1409

The “condition” for the samples is part of the “metadata” of the se object. Like I said, we will explore this in a later section, but for now we pull out the information we need:

condition <- se$smtsd
head(condition)
## [1] "Brain - Cerebellar Hemisphere"            
## [2] "Brain - Cerebellar Hemisphere"            
## [3] "Brain - Amygdala"                         
## [4] "Brain - Cortex"                           
## [5] "Brain - Nucleus accumbens (basal ganglia)"
## [6] "Brain - Cerebellar Hemisphere"

We’ll do a little string processing, to chop off the redundant information:

library(magrittr)
# substitute "Brain - ..." with "..."
condition %<>% (function(x) sub("Brain - (.*)", "\\1", x))
condition %<>% factor
table(condition)
## condition
##                          Amygdala  Anterior cingulate cortex (BA24) 
##                                81                                99 
##           Caudate (basal ganglia)             Cerebellar Hemisphere 
##                               134                               118 
##                        Cerebellum                            Cortex 
##                               145                               132 
##              Frontal Cortex (BA9)                       Hippocampus 
##                               120                               103 
##                      Hypothalamus Nucleus accumbens (basal ganglia) 
##                               104                               123 
##           Putamen (basal ganglia)        Spinal cord (cervical c-1) 
##                               103                                76 
##                  Substantia nigra 
##                                71

There are many rows which have so few measurements (counts), that we can safely ignore them for large-scale EDA (e.g. calculating distances between samples). These kind of filtering decisions are highly dataset-specific, you wouldn’t want to use this exact filter for another dataset, but here, it makes sense to ask that the row sum be larger than 100, because we have ~1400 samples.

dim(cts)
## [1] 58037  1409
rs <- rowSums(cts)
hist(log10(rs + 1))
abline(v=2, col="blue", lwd=3)

cts <- cts[rs > 100,]
dim(cts)
## [1] 48948  1409

A very simple plot is just to look at two samples’ gene expression values against each other in a scatterplot. Typically, we take the logarithm, as gene expression is highly skewed (many genes with low expression, a few with very high expression).

plot can be used to make scatterplots of x and y or of a matrix with two columns, but with 10s of thousands of observations it will be a bit slow. smoothScatter does a better job with so many data points.

smoothScatter(cts[,1:2])
## Warning in KernSmooth::bkde2D(x, bandwidth = bandwidth, gridsize =
## nbin, : Binning grid too coarse for current (small) bandwidth: consider
## increasing 'gridsize'

logcts <- log10(cts + 1)
smoothScatter(logcts[,1:2])

Here you can get a sense for the distribution of counts (log-scale).

hist(logcts[,1])

A first pass for looking at the sample distances is to make a PCA plot. To speed up the calculation of the PCs, we first subset to the top 500 genes by unsupervised variance of log counts (we don’t look at the condition to which the sample belong, just look at the total variance).

library(matrixStats)
rv <- rowVars(logcts)
o <- order(rv, decreasing=TRUE)[1:500]
pc <- prcomp(t(logcts[o,]))
plot(pc$x[,1:2])

We can get a better sense by coloring the groups by condition.

library(ggplot2)
dat <- data.frame(pc$x[,1:2], condition)
ggplot(dat, aes(PC1, PC2, col=condition)) + 
  geom_point()

Let’s focus on the samples belonging to four specific regions of the brain:

regions <- c("Cerebellum","Cortex","Hippocampus","Hypothalamus")
condition_sub_idx <- condition %in% regions
condition_sub <- droplevels(condition[condition_sub_idx])
table(condition_sub_idx)
## condition_sub_idx
## FALSE  TRUE 
##   925   484

We can compute Euclidean distances between the log counts for these samples (afterwards we will discuss considerations for calculating distances for count data).

# distance compares rows, 
# so we need to transpose the matrix
dists <- dist(t( logcts[o,condition_sub_idx] ))

Now we will put together pieces to visualize the distances as a heatmap. First, we convert the distances from a dist into a matrix:

class(dists)
## [1] "dist"
dist_mat <- as.matrix(dists)

We can plot these distances using the pheatmap package. The default colors are red, pale yellow, and blue. Let’s create a new color palette:

library(RColorBrewer)
colRamp <- colorRampPalette(brewer.pal(9,"Blues")) # returns a function
plot(1:100, col=colRamp(100), pch=20, main="Blues")

Let’s reverse, so that bigger distances will give lighter color, this will help to visualize similarity in the heatmap:

cols <- rev(colRamp(100))

We make a data.frame that matches the names of samples to condition:

anno_df <- data.frame(
  condition=condition_sub,
  row.names=colnames(dist_mat)
)

Putting it all together:

library(pheatmap)
pheatmap(dist_mat, 
         color=cols,
         clustering_distance_rows=dists,
         clustering_distance_cols=dists,
         annotation_col=anno_df,
         show_rownames=FALSE, show_colnames=FALSE)

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 11.5.1
## 
## Matrix products: default
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices datasets  utils     methods  
## [8] base     
## 
## other attached packages:
##  [1] pheatmap_1.0.12             RColorBrewer_1.1-3         
##  [3] ggplot2_3.3.6               magrittr_2.0.3             
##  [5] here_1.0.1                  SummarizedExperiment_1.27.1
##  [7] Biobase_2.57.1              GenomicRanges_1.49.0       
##  [9] GenomeInfoDb_1.33.3         IRanges_2.31.0             
## [11] S4Vectors_0.35.1            BiocGenerics_0.43.0        
## [13] MatrixGenerics_1.9.1        matrixStats_0.62.0         
## [15] pkgdown_2.0.5               testthat_3.1.4             
## [17] rmarkdown_2.14              devtools_2.4.3             
## [19] usethis_2.1.6              
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.1             pkgload_1.3.0          jsonlite_1.8.0        
##  [4] bslib_0.3.1            brio_1.1.3             assertthat_0.2.1      
##  [7] highr_0.9              GenomeInfoDbData_1.2.8 yaml_2.3.5            
## [10] remotes_2.4.2          sessioninfo_1.2.2      pillar_1.7.0          
## [13] lattice_0.20-45        glue_1.6.2             digest_0.6.29         
## [16] XVector_0.37.0         colorspace_2.0-3       htmltools_0.5.2       
## [19] Matrix_1.4-1           pkgconfig_2.0.3        zlibbioc_1.43.0       
## [22] purrr_0.3.4            scales_1.2.0           processx_3.6.1        
## [25] tibble_3.1.7           farver_2.1.1           generics_0.1.3        
## [28] ellipsis_0.3.2         withr_2.5.0            cachem_1.0.6          
## [31] cli_3.3.0              crayon_1.5.1           memoise_2.0.1         
## [34] evaluate_0.15          ps_1.7.1               fs_1.5.2              
## [37] fansi_1.0.3            pkgbuild_1.3.1         tools_4.2.1           
## [40] prettyunits_1.1.1      lifecycle_1.0.1        stringr_1.4.0         
## [43] munsell_0.5.0          DelayedArray_0.23.0    callr_3.7.0           
## [46] compiler_4.2.1         jquerylib_0.1.4        rlang_1.0.3           
## [49] grid_4.2.1             RCurl_1.98-1.7         rstudioapi_0.13       
## [52] labeling_0.4.2         bitops_1.0-7           gtable_0.3.0          
## [55] DBI_1.1.3              R6_2.5.1               knitr_1.39            
## [58] dplyr_1.0.9            fastmap_1.1.0          utf8_1.2.2            
## [61] rprojroot_2.0.3        KernSmooth_2.23-20     stringi_1.7.6         
## [64] vctrs_0.4.1            tidyselect_1.1.2       xfun_0.31