Surrogate variable analysis: hidden batch effects

Author

We showed in the section on batch effects, that we can sometimes identify the source of batch effects, and by using statistical models, we can remove any sample-specific variation we can predict based on features like sequence content or gene length. Here we will show a powerful procedure, which doesn’t require the use of knowing exactly how the counts will vary across batches – it uses only the biological condition, and looks for large scale variation which is orthogonal to the biological condition. This approach (we will examine a specific method called surrogate variable analysis, or SVA), does require that the technical variation be orthogonal to the biological conditions.

For the description of the SVA method, see these two papers:

Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis link
svaseq: removing batch effects and other unwanted noise from sequencing data link

Other methods which can identify hidden batches are described in these papers:

Normalization of RNA-seq data using factor analysis of control genes or samples link
Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses link

We start by downloading RNA-seq count tables from the original recount project:

url <- "http://bowtie-bio.sourceforge.net/recount/countTables/bottomly_count_table.txt"
file <- "bottomly_count_table.txt"
if (!file.exists(file)) download.file(url, file)
url <- "http://bowtie-bio.sourceforge.net/recount/phenotypeTables/bottomly_phenodata.txt"
file <- "bottomly_phenodata.txt"
if (!file.exists(file)) download.file(url, file)

This experiment consisted of 10 and 11 mice of two different strains. But useful for our demonstration of SVA, it is key that the experiment was performed in three batches. We will pretend in the following code that we do not know the batches, although we do. We will estimate technical variation using only the information about which mouse is of which strain.

library(magrittr)
cts <- read.table("bottomly_count_table.txt", header=TRUE, row.names=1)
coldata <- read.table("bottomly_phenodata.txt", header=TRUE, row.names=1)
all(colnames(cts) == rownames(coldata))

[1] TRUE

coldata$strain %<>% (function(x) sub("/",".",x))
coldata$strain %<>% factor

We use DESeq2 to produce a normalized count matrix, which is the input to SVA. Note that the biological groups (mouse strain) were present in all batches, so a block design.

library(DESeq2)
dds <- DESeqDataSetFromMatrix(cts, coldata, ~strain)
dds$batch <- factor(dds$experiment.number)
table(dds$strain, dds$batch)

          
           4 6 7
  C57BL.6J 3 4 3
  DBA.2J   4 3 4

We estimate the library size correction and save the normalized counts matrix:

dds <- estimateSizeFactors(dds)
norm.cts <- counts(dds, normalized=TRUE)

Finally, we provide the normalized counts and two model matrices to SVA. The first matrix uses the biological condition, and the second model matrix is the null model, here just an intercept term.

library(sva)
mm <- model.matrix(~ strain, colData(dds))
mm0 <- model.matrix(~ 1, colData(dds))
norm.cts <- norm.cts[rowSums(norm.cts) > 0,]
fit <- svaseq(norm.cts, mod=mm, mod0=mm0, n.sv=2)

Number of significant surrogate variables is:  2 
Iteration (out of 5 ):1  2  3  4  5

If we plot the estimated surrogate variables, we see they separate the samples from different batches fairly well. This procedure can be used to estimate unknown, or undocumented batches present in high-throughput datasets, although we must be careful not to overfit and remove too much of the variation in the data – some of which represents natural biological variations of samples.

library(rafalib)
bigpar()
dds$strain.int <- as.integer(dds$strain) + 15
plot(fit$sv[,1:2], col=dds$batch, pch=dds$strain.int, cex=2,
     xlab="SV1", ylab="SV2")
legend("top", levels(dds$batch), pch=16,
       col=1:3, cex=.8, ncol=3, title="batch")

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] rafalib_1.0.0               sva_3.52.0                  BiocParallel_1.38.0        
 [4] genefilter_1.86.0           mgcv_1.9-1                  nlme_3.1-165               
 [7] DESeq2_1.44.0               SummarizedExperiment_1.34.0 Biobase_2.64.0             
[10] MatrixGenerics_1.16.0       matrixStats_1.3.0           GenomicRanges_1.56.1       
[13] GenomeInfoDb_1.40.1         IRanges_2.38.1              S4Vectors_0.42.1           
[16] BiocGenerics_0.50.0         magrittr_2.0.3              testthat_3.2.1.1           
[19] rmarkdown_2.27              devtools_2.4.5              usethis_3.0.0              

loaded via a namespace (and not attached):
 [1] DBI_1.2.3               remotes_2.5.0           rlang_1.1.4             compiler_4.4.1         
 [5] RSQLite_2.3.7           png_0.1-8               vctrs_0.6.5             stringr_1.5.1          
 [9] profvis_0.3.8           pkgconfig_2.0.3         crayon_1.5.3            fastmap_1.2.0          
[13] XVector_0.44.0          ellipsis_0.3.2          utf8_1.2.4              promises_1.3.0         
[17] sessioninfo_1.2.2       UCSC.utils_1.0.0        purrr_1.0.2             bit_4.0.5              
[21] xfun_0.46               zlibbioc_1.50.0         cachem_1.1.0            jsonlite_1.8.8         
[25] blob_1.2.4              later_1.3.2             DelayedArray_0.30.1     parallel_4.4.1         
[29] R6_2.5.1                RColorBrewer_1.1-3      stringi_1.8.4           limma_3.60.4           
[33] pkgload_1.4.0           brio_1.1.5              Rcpp_1.0.13             knitr_1.48             
[37] httpuv_1.6.15           Matrix_1.7-0            splines_4.4.1           tidyselect_1.2.1       
[41] rstudioapi_0.16.0       abind_1.4-5             yaml_2.3.10             codetools_0.2-20       
[45] miniUI_0.1.1.1          pkgbuild_1.4.4          lattice_0.22-6          tibble_3.2.1           
[49] shiny_1.9.1             KEGGREST_1.44.1         evaluate_0.24.0         survival_3.7-0         
[53] urlchecker_1.0.1        Biostrings_2.72.1       pillar_1.9.0            generics_0.1.3         
[57] ggplot2_3.5.1           munsell_0.5.1           scales_1.3.0            xtable_1.8-4           
[61] glue_1.7.0              tools_4.4.1             annotate_1.82.0         locfit_1.5-9.10        
[65] fs_1.6.4                XML_3.99-0.17           grid_4.4.1              edgeR_4.2.1            
[69] AnnotationDbi_1.66.0    colorspace_2.1-1        GenomeInfoDbData_1.2.12 cli_3.6.3              
[73] fansi_1.0.6             S4Arrays_1.4.1          dplyr_1.1.4             gtable_0.3.5           
[77] digest_0.6.36           SparseArray_1.4.8       htmlwidgets_1.6.4       memoise_2.0.1          
[81] htmltools_0.5.8.1       lifecycle_1.0.4         httr_1.4.7              statmod_1.5.0          
[85] mime_0.12               bit64_4.0.5