<- "http://bowtie-bio.sourceforge.net/recount/countTables/bottomly_count_table.txt"
url <- "bottomly_count_table.txt"
file if (!file.exists(file)) download.file(url, file)
<- "http://bowtie-bio.sourceforge.net/recount/phenotypeTables/bottomly_phenodata.txt"
url <- "bottomly_phenodata.txt"
file if (!file.exists(file)) download.file(url, file)
Surrogate variable analysis: hidden batch effects
We showed in the section on batch effects, that we can sometimes identify the source of batch effects, and by using statistical models, we can remove any sample-specific variation we can predict based on features like sequence content or gene length. Here we will show a powerful procedure, which doesn’t require the use of knowing exactly how the counts will vary across batches – it uses only the biological condition, and looks for large scale variation which is orthogonal to the biological condition. This approach (we will examine a specific method called surrogate variable analysis, or SVA), does require that the technical variation be orthogonal to the biological conditions.
For the description of the SVA method, see these two papers:
- Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis link
- svaseq: removing batch effects and other unwanted noise from sequencing data link
Other methods which can identify hidden batches are described in these papers:
- Normalization of RNA-seq data using factor analysis of control genes or samples link
- Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses link
We start by downloading RNA-seq count tables from the original recount project:
This experiment consisted of 10 and 11 mice of two different strains. But useful for our demonstration of SVA, it is key that the experiment was performed in three batches. We will pretend in the following code that we do not know the batches, although we do. We will estimate technical variation using only the information about which mouse is of which strain.
library(magrittr)
<- read.table("bottomly_count_table.txt", header=TRUE, row.names=1)
cts <- read.table("bottomly_phenodata.txt", header=TRUE, row.names=1)
coldata all(colnames(cts) == rownames(coldata))
[1] TRUE
$strain %<>% (function(x) sub("/",".",x))
coldata$strain %<>% factor coldata
We use DESeq2 to produce a normalized count matrix, which is the input to SVA. Note that the biological groups (mouse strain) were present in all batches, so a block design.
library(DESeq2)
<- DESeqDataSetFromMatrix(cts, coldata, ~strain)
dds $batch <- factor(dds$experiment.number)
ddstable(dds$strain, dds$batch)
4 6 7
C57BL.6J 3 4 3
DBA.2J 4 3 4
We estimate the library size correction and save the normalized counts matrix:
<- estimateSizeFactors(dds)
dds <- counts(dds, normalized=TRUE) norm.cts
Finally, we provide the normalized counts and two model matrices to SVA. The first matrix uses the biological condition, and the second model matrix is the null model, here just an intercept term.
library(sva)
<- model.matrix(~ strain, colData(dds))
mm <- model.matrix(~ 1, colData(dds))
mm0 <- norm.cts[rowSums(norm.cts) > 0,]
norm.cts <- svaseq(norm.cts, mod=mm, mod0=mm0, n.sv=2) fit
Number of significant surrogate variables is: 2
Iteration (out of 5 ):1 2 3 4 5
If we plot the estimated surrogate variables, we see they separate the samples from different batches fairly well. This procedure can be used to estimate unknown, or undocumented batches present in high-throughput datasets, although we must be careful not to overfit and remove too much of the variation in the data – some of which represents natural biological variations of samples.
library(rafalib)
bigpar()
$strain.int <- as.integer(dds$strain) + 15
ddsplot(fit$sv[,1:2], col=dds$batch, pch=dds$strain.int, cex=2,
xlab="SV1", ylab="SV2")
legend("top", levels(dds$batch), pch=16,
col=1:3, cex=.8, ncol=3, title="batch")
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods base
other attached packages:
[1] rafalib_1.0.0 sva_3.52.0 BiocParallel_1.38.0
[4] genefilter_1.86.0 mgcv_1.9-1 nlme_3.1-165
[7] DESeq2_1.44.0 SummarizedExperiment_1.34.0 Biobase_2.64.0
[10] MatrixGenerics_1.16.0 matrixStats_1.3.0 GenomicRanges_1.56.1
[13] GenomeInfoDb_1.40.1 IRanges_2.38.1 S4Vectors_0.42.1
[16] BiocGenerics_0.50.0 magrittr_2.0.3 testthat_3.2.1.1
[19] rmarkdown_2.27 devtools_2.4.5 usethis_3.0.0
loaded via a namespace (and not attached):
[1] DBI_1.2.3 remotes_2.5.0 rlang_1.1.4 compiler_4.4.1
[5] RSQLite_2.3.7 png_0.1-8 vctrs_0.6.5 stringr_1.5.1
[9] profvis_0.3.8 pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
[13] XVector_0.44.0 ellipsis_0.3.2 utf8_1.2.4 promises_1.3.0
[17] sessioninfo_1.2.2 UCSC.utils_1.0.0 purrr_1.0.2 bit_4.0.5
[21] xfun_0.46 zlibbioc_1.50.0 cachem_1.1.0 jsonlite_1.8.8
[25] blob_1.2.4 later_1.3.2 DelayedArray_0.30.1 parallel_4.4.1
[29] R6_2.5.1 RColorBrewer_1.1-3 stringi_1.8.4 limma_3.60.4
[33] pkgload_1.4.0 brio_1.1.5 Rcpp_1.0.13 knitr_1.48
[37] httpuv_1.6.15 Matrix_1.7-0 splines_4.4.1 tidyselect_1.2.1
[41] rstudioapi_0.16.0 abind_1.4-5 yaml_2.3.10 codetools_0.2-20
[45] miniUI_0.1.1.1 pkgbuild_1.4.4 lattice_0.22-6 tibble_3.2.1
[49] shiny_1.9.1 KEGGREST_1.44.1 evaluate_0.24.0 survival_3.7-0
[53] urlchecker_1.0.1 Biostrings_2.72.1 pillar_1.9.0 generics_0.1.3
[57] ggplot2_3.5.1 munsell_0.5.1 scales_1.3.0 xtable_1.8-4
[61] glue_1.7.0 tools_4.4.1 annotate_1.82.0 locfit_1.5-9.10
[65] fs_1.6.4 XML_3.99-0.17 grid_4.4.1 edgeR_4.2.1
[69] AnnotationDbi_1.66.0 colorspace_2.1-1 GenomeInfoDbData_1.2.12 cli_3.6.3
[73] fansi_1.0.6 S4Arrays_1.4.1 dplyr_1.1.4 gtable_0.3.5
[77] digest_0.6.36 SparseArray_1.4.8 htmlwidgets_1.6.4 memoise_2.0.1
[81] htmltools_0.5.8.1 lifecycle_1.0.4 httr_1.4.7 statmod_1.5.0
[85] mime_0.12 bit64_4.0.5