We showed in the section on batch effects, that we can sometimes identify the source of batch effects, and by using statistical models, we can remove any sample-specific variation we can predict based on features like sequence content or gene length. Here we will show a powerful procedure, which doesn’t require the use of knowing exactly how the counts will vary across batches – it uses only the biological condition, and looks for large scale variation which is orthogonal to the biological condition. This approach (we will examine a specific method called surrogate variable analysis, or SVA), does require that the technical variation be orthogonal to the biological conditions.

For the description of the SVA method, see these two papers:

Other methods which can identify hidden batches are described in these papers:

We start by downloading RNA-seq count tables from the original recount project:

url <- "http://bowtie-bio.sourceforge.net/recount/countTables/bottomly_count_table.txt"
file <- "bottomly_count_table.txt"
if (!file.exists(file)) download.file(url, file)
url <- "http://bowtie-bio.sourceforge.net/recount/phenotypeTables/bottomly_phenodata.txt"
file <- "bottomly_phenodata.txt"
if (!file.exists(file)) download.file(url, file)

This experiment consisted of 10 and 11 mice of two different strains. But useful for our demonstration of SVA, it is key that the experiment was performed in three batches. We will pretend in the following code that we do not know the batches, although we do. We will estimate technical variation using only the information about which mouse is of which strain.

library(magrittr)
cts <- read.table("bottomly_count_table.txt", header=TRUE, row.names=1)
coldata <- read.table("bottomly_phenodata.txt", header=TRUE, row.names=1)
all(colnames(cts) == rownames(coldata))
## [1] TRUE
coldata$strain %<>% (function(x) sub("/",".",x))
coldata$strain %<>% factor

We use DESeq2 to produce a normalized count matrix, which is the input to SVA. Note that the biological groups (mouse strain) were present in all batches, so a block design.

library(DESeq2)
dds <- DESeqDataSetFromMatrix(cts, coldata, ~strain)
dds$batch <- factor(dds$experiment.number)
table(dds$strain, dds$batch)
##           
##            4 6 7
##   C57BL.6J 3 4 3
##   DBA.2J   4 3 4

We estimate the library size correction and save the normalized counts matrix:

dds <- estimateSizeFactors(dds)
norm.cts <- counts(dds, normalized=TRUE)

Finally, we provide the normalized counts and two model matrices to SVA. The first matrix uses the biological condition, and the second model matrix is the null model, here just an intercept term.

library(sva)
mm <- model.matrix(~ strain, colData(dds))
mm0 <- model.matrix(~ 1, colData(dds))
norm.cts <- norm.cts[rowSums(norm.cts) > 0,]
fit <- svaseq(norm.cts, mod=mm, mod0=mm0, n.sv=2)
## Number of significant surrogate variables is:  2 
## Iteration (out of 5 ):1  2  3  4  5

If we plot the estimated surrogate variables, we see they separate the samples from different batches fairly well. This procedure can be used to estimate unknown, or undocumented batches present in high-throughput datasets, although we must be careful not to overfit and remove too much of the variation in the data – some of which represents natural biological variations of samples.

library(rafalib)
bigpar()
dds$strain.int <- as.integer(dds$strain) + 15
plot(fit$sv[,1:2], col=dds$batch, pch=dds$strain.int, cex=2,
     xlab="SV1", ylab="SV2")
legend("top", levels(dds$batch), pch=16,
       col=1:3, cex=.8, ncol=3, title="batch")

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
##  [1] splines   stats4    parallel  stats     graphics  grDevices datasets  utils     methods  
## [10] base     
## 
## other attached packages:
##  [1] sva_3.38.0                        BiocParallel_1.24.1              
##  [3] genefilter_1.72.0                 mgcv_1.8-33                      
##  [5] nlme_3.1-151                      magrittr_2.0.1                   
##  [7] cqn_1.36.0                        quantreg_5.75                    
##  [9] SparseM_1.78                      preprocessCore_1.52.0            
## [11] nor1mix_1.3-0                     mclust_5.4.7                     
## [13] rafalib_1.0.0                     DESeq2_1.30.0                    
## [15] SummarizedExperiment_1.20.0       MatrixGenerics_1.2.0             
## [17] matrixStats_0.57.0                GenomicFeatures_1.42.1           
## [19] AnnotationDbi_1.52.0              Biobase_2.50.0                   
## [21] BSgenome.Hsapiens.UCSC.hg38_1.4.3 BSgenome_1.58.0                  
## [23] rtracklayer_1.50.0                Biostrings_2.58.0                
## [25] XVector_0.30.0                    GenomicRanges_1.42.0             
## [27] GenomeInfoDb_1.26.2               IRanges_2.24.1                   
## [29] S4Vectors_0.28.1                  BiocGenerics_0.36.0              
## [31] testthat_3.0.1                    rmarkdown_2.6                    
## [33] devtools_2.3.2                    usethis_2.0.0                    
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_2.0-0         ellipsis_0.3.1           rprojroot_2.0.2         
##  [4] fs_1.5.0                 MatrixModels_0.4-1       remotes_2.2.0           
##  [7] bit64_4.0.5              fansi_0.4.1              xml2_1.3.2              
## [10] codetools_0.2-18         geneplotter_1.68.0       knitr_1.30              
## [13] pkgload_1.1.0            Rsamtools_2.6.0          annotate_1.68.0         
## [16] dbplyr_2.0.0             BiocManager_1.30.10      compiler_4.0.3          
## [19] httr_1.4.2               assertthat_0.2.1         Matrix_1.3-0            
## [22] limma_3.46.0             cli_2.2.0                htmltools_0.5.0         
## [25] prettyunits_1.1.1        tools_4.0.3              gtable_0.3.0            
## [28] glue_1.4.2               GenomeInfoDbData_1.2.4   dplyr_1.0.2             
## [31] rappdirs_0.3.1           Rcpp_1.0.5               vctrs_0.3.6             
## [34] conquer_1.0.2            xfun_0.19                stringr_1.4.0           
## [37] ps_1.5.0                 lifecycle_0.2.0          XML_3.99-0.5            
## [40] edgeR_3.32.0             zlibbioc_1.36.0          scales_1.1.1            
## [43] hms_0.5.3                RColorBrewer_1.1-2       yaml_2.2.1              
## [46] curl_4.3                 memoise_1.1.0            ggplot2_3.3.3           
## [49] biomaRt_2.46.2           stringi_1.5.3            RSQLite_2.2.1           
## [52] desc_1.2.0               pkgbuild_1.2.0           rlang_0.4.9             
## [55] pkgconfig_2.0.3          bitops_1.0-6             evaluate_0.14           
## [58] lattice_0.20-41          purrr_0.3.4              GenomicAlignments_1.26.0
## [61] bit_4.0.4                processx_3.4.5           tidyselect_1.1.0        
## [64] R6_2.5.0                 generics_0.1.0           DelayedArray_0.16.0     
## [67] DBI_1.1.0                pillar_1.4.7             withr_2.3.0             
## [70] survival_3.2-7           RCurl_1.98-1.2           tibble_3.0.4            
## [73] crayon_1.3.4             BiocFileCache_1.14.0     progress_1.2.2          
## [76] locfit_1.5-9.4           grid_4.0.3               blob_1.2.1              
## [79] callr_3.5.1              digest_0.6.27            xtable_1.8-4            
## [82] openssl_1.4.3            munsell_0.5.0            sessioninfo_1.1.1       
## [85] askpass_1.1