if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
::install("SummarizedExperiment") BiocManager
Exploring RNA measurements from brain
Here we will show some very simple explorations of high-dimensional data typical of computational biology. This particular study has a study ID SRP012682
and it happens to be the Genotype-Tissue Expression Project or GTEx.
You can look up the study on the SRA (Sequence Read Archive) or the ENA (European Nucleotide Archive) and find a description:
The aim of the Genotype-Tissue Expression (GTEx) Project is to increase our understanding of how changes in our genes affect human health and disease with the ultimate goal of improving health care for future generations. GTEx will create a database that researchers can use to study how inherited changes in genes lead to common diseases. GTEx researchers are studying genes in different tissues obtained from many different people….
There is a computational project, called recount2, which performs a basic summarization of public data sets with gene expression data. There are numerous competing methods for computing gene expression from raw data, which we will cover later in the course. For now, let’s just load the summarized table computed by recount2. This gives us, for each gene and each sample, a measurement of gene expression. We can use these measurements to compare samples, and we will discuss distances and normalization as a future topic.
First we need to install the SummarizedExperiment R package, which is hosted on Bioconductor:
We can then download the summarized experiment data (note, it is 200 Mb so it may take a minute):
library(here)
here() starts at /Users/milove/teach/compbio/compbio_src
<- "http://duffel.rail.bio/recount/SRP012682/rse_gene_brain.Rdata"
url <- here("eda","brain.rda")
file if (!file.exists(file)) download.file(url, file)
load(file)
I’ve written a simple function which scales the gene expression data. It should roughly be counts of molecules detected for each gene and for each sample. It takes maybe half a minute to do the scaling.
library(SummarizedExperiment)
library(here)
source(here("bioc","my_scale_counts.R"))
<- my_scale_counts(rse_gene) se
We will learn about this se
object later, but for now we will just extract the information we need to do some EDA (exploratory data analysis). There are smarter ways to work with these objects than extracting the matrices, which we will see next week when we explore Bioconductor objects.
We have ~60,000 genes and ~1400 samples.
<- assay(se)
cts dim(cts)
[1] 58037 1409
The “condition” for the samples is part of the “metadata” of the se
object. Like I said, we will explore this in a later section, but for now we pull out the information we need:
<- se$smtsd
condition head(condition)
[1] "Brain - Cerebellar Hemisphere" "Brain - Cerebellar Hemisphere"
[3] "Brain - Amygdala" "Brain - Cortex"
[5] "Brain - Nucleus accumbens (basal ganglia)" "Brain - Cerebellar Hemisphere"
We’ll do a little string processing, to chop off the redundant information:
library(magrittr)
# substitute "Brain - ..." with "..."
%<>% (function(x) sub("Brain - (.*)", "\\1", x))
condition %<>% factor
condition table(condition)
condition
Amygdala Anterior cingulate cortex (BA24)
81 99
Caudate (basal ganglia) Cerebellar Hemisphere
134 118
Cerebellum Cortex
145 132
Frontal Cortex (BA9) Hippocampus
120 103
Hypothalamus Nucleus accumbens (basal ganglia)
104 123
Putamen (basal ganglia) Spinal cord (cervical c-1)
103 76
Substantia nigra
71
There are many rows which have so few measurements (counts), that we can safely ignore them for large-scale EDA (e.g. calculating distances between samples). These kind of filtering decisions are highly dataset-specific, you wouldn’t want to use this exact filter for another dataset, but here, it makes sense to ask that the row sum be larger than 100, because we have ~1400 samples.
dim(cts)
[1] 58037 1409
<- rowSums(cts)
rs hist(log10(rs + 1))
abline(v=2, col="blue", lwd=3)
<- cts[rs > 100,]
cts dim(cts)
[1] 48948 1409
A very simple plot is just to look at two samples’ gene expression values against each other in a scatterplot. Typically, we take the logarithm, as gene expression is highly skewed (many genes with low expression, a few with very high expression).
plot
can be used to make scatterplots of x
and y
or of a matrix with two columns, but with 10s of thousands of observations it will be a bit slow. smoothScatter
does a better job with so many data points.
smoothScatter(cts[,1:2])
Warning in KernSmooth::bkde2D(x, bandwidth = bandwidth, gridsize = nbin, : Binning grid too coarse
for current (small) bandwidth: consider increasing 'gridsize'
<- log10(cts + 1)
logcts smoothScatter(logcts[,1:2])
Here you can get a sense for the distribution of counts (log-scale).
hist(logcts[,1])
A first pass for looking at the sample distances is to make a PCA plot. To speed up the calculation of the PCs, we first subset to the top 500 genes by unsupervised variance of log counts (we don’t look at the condition to which the sample belong, just look at the total variance).
library(matrixStats)
<- rowVars(logcts)
rv <- order(rv, decreasing=TRUE)[1:500]
o <- prcomp(t(logcts[o,]))
pc plot(pc$x[,1:2])
We can get a better sense by coloring the groups by condition.
library(ggplot2)
<- data.frame(pc$x[,1:2], condition)
dat ggplot(dat, aes(PC1, PC2, col=condition)) +
geom_point()
Let’s focus on the samples belonging to four specific regions of the brain:
<- c("Cerebellum","Cortex","Hippocampus","Hypothalamus")
regions <- condition %in% regions
condition_sub_idx <- droplevels(condition[condition_sub_idx])
condition_sub table(condition_sub_idx)
condition_sub_idx
FALSE TRUE
925 484
We can compute Euclidean distances between the log counts for these samples (afterwards we will discuss considerations for calculating distances for count data).
# distance compares rows,
# so we need to transpose the matrix
<- dist(t( logcts[o,condition_sub_idx] )) dists
Now we will put together pieces to visualize the distances as a heatmap. First, we convert the distances from a dist into a matrix:
class(dists)
[1] "dist"
<- as.matrix(dists) dist_mat
We can plot these distances using the pheatmap package. The default colors are red, pale yellow, and blue. Let’s create a new color palette:
library(RColorBrewer)
<- colorRampPalette(brewer.pal(9,"Blues")) # returns a function
colRamp plot(1:100, col=colRamp(100), pch=20, main="Blues")
Let’s reverse, so that bigger distances will give lighter color, this will help to visualize similarity in the heatmap:
<- rev(colRamp(100)) cols
We make a data.frame that matches the names of samples to condition:
<- data.frame(
anno_df condition=condition_sub,
row.names=colnames(dist_mat)
)
Putting it all together:
library(pheatmap)
pheatmap(dist_mat,
color=cols,
clustering_distance_rows=dists,
clustering_distance_cols=dists,
annotation_col=anno_df,
show_rownames=FALSE, show_colnames=FALSE)
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods base
other attached packages:
[1] pheatmap_1.0.12 RColorBrewer_1.1-3 ggplot2_3.5.1
[4] magrittr_2.0.3 SummarizedExperiment_1.34.0 Biobase_2.64.0
[7] GenomicRanges_1.56.1 GenomeInfoDb_1.40.1 IRanges_2.38.1
[10] S4Vectors_0.42.1 BiocGenerics_0.50.0 MatrixGenerics_1.16.0
[13] matrixStats_1.3.0 here_1.0.1 testthat_3.2.1.1
[16] rmarkdown_2.27 devtools_2.4.5 usethis_3.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.5 xfun_0.46 htmlwidgets_1.6.4 remotes_2.5.0
[5] lattice_0.22-6 generics_0.1.3 vctrs_0.6.5 tools_4.4.1
[9] fansi_1.0.6 tibble_3.2.1 pkgconfig_2.0.3 Matrix_1.7-0
[13] KernSmooth_2.23-24 lifecycle_1.0.4 GenomeInfoDbData_1.2.12 farver_2.1.2
[17] compiler_4.4.1 stringr_1.5.1 brio_1.1.5 munsell_0.5.1
[21] httpuv_1.6.15 htmltools_0.5.8.1 yaml_2.3.10 pillar_1.9.0
[25] later_1.3.2 crayon_1.5.3 urlchecker_1.0.1 ellipsis_0.3.2
[29] cachem_1.1.0 DelayedArray_0.30.1 sessioninfo_1.2.2 abind_1.4-5
[33] mime_0.12 tidyselect_1.2.1 digest_0.6.36 stringi_1.8.4
[37] dplyr_1.1.4 purrr_1.0.2 labeling_0.4.3 rprojroot_2.0.4
[41] fastmap_1.2.0 grid_4.4.1 colorspace_2.1-1 cli_3.6.3
[45] SparseArray_1.4.8 S4Arrays_1.4.1 utf8_1.2.4 pkgbuild_1.4.4
[49] withr_3.0.1 scales_1.3.0 UCSC.utils_1.0.0 promises_1.3.0
[53] XVector_0.44.0 httr_1.4.7 memoise_2.0.1 shiny_1.9.1
[57] evaluate_0.24.0 knitr_1.48 miniUI_0.1.1.1 profvis_0.3.8
[61] rlang_1.1.4 Rcpp_1.0.13 xtable_1.8-4 glue_1.7.0
[65] pkgload_1.4.0 rstudioapi_0.16.0 jsonlite_1.8.8 R6_2.5.1
[69] fs_1.6.4 zlibbioc_1.50.0