Batch effects and GC content

Author

Previously, we looked at distances between samples in high-throughput sequencing experiments, exploring sequencing depth as a technical artifact, the effect of various transformation on stabilizing variance, and PCA and hierarchical clustering as methods for ordination of samples.

In distances, we noticed many genes where we see differences in measurements across the sequencing center, and in hclust, we saw high level clustering of samples by sequencing center, when we focused on a single human population. These technical differences in high-throughput measurements are often referred to as “batch effects”, a term which encompasses any kind of technical artifact due to the lab or time at which a sample was processed. Note that the time of sample preparation, the batch, can have its own unique technical distortion on measurements, just as much as a sample from a different sequencing center.

Two references which give a general outline of batch effects in sequencing data are:

Tackling the widespread and critical impact of batch effects in high-throughput data link
Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries link

It is critical, if you are involved in the design of a high-throughput experiment, that the biological conditions are not confounded with the sample preparation batches, instead to use either a block or randomized design. Block designs are easy, simply ensuring that each sample preparation batch contains each of the biological conditions of interest, so that the batch effects can be isolated apart from the biological differences between samples.

Here I will show the origin of batch effects across lab can sometimes be isolated to batch-specific distortions related to the DNA sequence of the genes. As described in the second link above, a step in nearly all high-throughput experiments is to amplify the DNA fragments using PCR. As PCR is an exponential process of copying molecules, it is very sensitive to slight variations, and results in distortions of measurements which are specific to the particular place and time when the samples were processed: the batch.

GEUVADIS dataset

We start again with the GEUVADIS RNA-seq samples prepared in distances.

library(DESeq2)
load("geuvadis.rda")

As typical, though we conceptually have a simple task to perform, many steps are needed just to get the covariates that we need. Here our simple task is to model the counts on covariates like the genes’ lengths and sequence content. However, it will take many steps to obtain and summarize the sequence content into a single number.

A critical summary of DNA sequence is the GC content, which is the fraction of G’s and C’s out of the total sequence. Note that, because G pairs with C, and C with G, the GC content of a sequence and it’s complement is identical. GC content varies along the genome, for biological reasons. Different organisms have very different GC content of their genomes. But we care about GC content also for a technical reason: because the GC content of a piece of DNA can make it difficulty to amplify. We will see this below.

In order to get the GC content of each gene, we start by downloading the information about the genes used by recount2 for quantification, which was version 25 of the Gencode genes. We download this file from the following link (37 Mb).

library(GenomicFeatures)

We need to import this gene information from a GTF (gene transfer format) file, and turn it into a TxDb. We then save the database, so we can skip the makeTxDbFromGFF step in the future. Note that a warning message about The "phase" metadata column contains non-NA values, and another one about closing connection are both OK.

txdb <- makeTxDbFromGFF("gencode.v25.annotation.gtf.gz")
saveDb(txdb, file="gencode.sqlite")

We load the TxDb and extract all the exons, grouped by gene. We check that we have the same gene names in ebg and in dds.

txdb <- loadDb("gencode.sqlite")
ebg <- exonsBy(txdb, by="gene")
head(names(ebg))

[1] "ENSG00000000003.14" "ENSG00000000005.5"  "ENSG00000000419.12" "ENSG00000000457.13"
[5] "ENSG00000000460.16" "ENSG00000000938.12"

table(names(ebg) %in% rownames(dds))


 TRUE 
58037

table(rownames(dds) %in% names(ebg))


 TRUE 
58037

Note that the exons in ebg contain redundant sequence. We can see this by plotting the ranges for a given gene. Note that, after running reduce, we remove all redundant sequence in ebg for a given gene.

e <- ebg[[1]]
library(rafalib)
plotRanges <- function(e) {
  l <- length(e)
  r <- ranges(range(e))
  nullplot(start(r), end(r), 0, l+1)
  segments(start(e), 1:l, end(e), 1:l, lwd=5)
}
plotRanges(e)

plotRanges(reduce(e))

Now we put the reduced exons in correct order (in this case they are already in correct order), and we store them as rowRanges of the dataset.

exons <- reduce(ebg)
exons <- exons[ rownames(dds) ]
rowRanges(dds) <- exons

Calculate GC content and length of reduced exons

Now we extract the exonic sequence for every gene, using the extractTranscriptSeqs function.

# package is ~700 Mb
library(BSgenome.Hsapiens.UCSC.hg38)
dna <- extractTranscriptSeqs(Hsapiens, rowRanges(dds))

We then calculate the GC content (ratio of G or C to total basepairs) with letterFrequency, and save this as a metadata column gc. We also save the total number of basepairs to a metadata column len.

mcols(dds)$gc <- as.numeric(letterFrequency(dna, "GC", as.prob=TRUE))
mcols(dds)$len <- sum(width(rowRanges(dds)))

with(mcols(dds), hist(gc))

with(mcols(dds), hist(log10(len)))

We know have all the covariates we need for modeling how the counts vary by GC content. We can make simple plots to see if we see a dependence. Note that, outside of GC content of .35-.65, we see very few large counts, although there do appear to be genes with this GC content. It is very difficult to amplify the fragments of cDNA from these genes, and so they are often missing from high-throughput sequencing experiments like RNA-seq.

plot(mcols(dds)$gc, log10(counts(dds)[,1]+1), cex=.1)
abline(v=c(.35,.65))

We also plot the counts over the length of the gene. It is expected that, everything else being equal, we should see higher counts from longer genes. Keep in mind though, that we do not expect a line in this scatterplot, due to differences in gene expression of the genes at a given length. In other words, any point on the x-axis, the differences in a vertical band can be explained by gene expression (as well as any other technical covariates, like GC content).

plot(log10(mcols(dds)$len), log10(counts(dds)[,1]+1), cex=.1)

Conditional Quantile Normalization

We will now use a Bioconductor package called cqn to model systematic dependence of counts on gene GC content and length. We provide the cqn function with the counts, the GC content and the gene lengths.

You can ignore a warning about use of 'sig2' is deprecated...

library(cqn)
idx <- dds$population == "TSI"
dds2 <- dds[,idx]
cts <- counts(dds2)
fit <- cqn(cts, mcols(dds2)$gc, mcols(dds2)$len)

Warning in norMix(mu = mix.param$mean, sig2 = mix.param$variance$sigmasq, : The use of 'sig2' is
deprecated; do specify 'sigma' (= sqrt(sig2)) instead

The plots show estimated spline dependence of counts on GC (n=1)…

cqnplot(fit, n=1)

…and dependence on length (n=2).

cqnplot(fit, n=2, xlim=c(-2,4.5))

Both of these plots are typical: for GC content, we see an upside-down “U”, where the low and high GC content fragments are systematically underrepresented (although we see lots of sample-sample variability in the estimated splines). And it is typical to have more counts for longer genes due to fragmentation. The tick marks on the x-axis indicate the knots of the splines, by default, these are data quantiles: 0.025, 0.25, 0.5, 0.75, 0.975.

We can draw the lines with the sequencing center as the color. Doing this, we see that the lines cluster by sequencing center. The dependence of counts on the GC content of the features being sequenced is highly batch specific.

library(rafalib)
bigpar()
cqnplot(fit, n=1, col=dds2$center)
legend("bottom", levels(dds2$center), fill=1:nlevels(dds2$center))

There is less variation of counts on gene length across center:

bigpar()
cqnplot(fit, n=2, col=dds2$center, xlim=c(-2,4.5))
legend("topleft", levels(dds2$center), fill=1:nlevels(dds2$center))

Finally, we zoom in to two sequencing centers, to show how different the dependence of counts on GC content can be across batch. You can also cross reference this final plot with Figure 2 here, which goes into more depth on the topic.

idx <- dds$population == "TSI" & dds$center %in% c("CGR","UG")
dds3 <- dds[,idx]
cts <- counts(dds3)
fit <- cqn(cts, mcols(dds3)$gc, mcols(dds3)$len)

Warning in norMix(mu = mix.param$mean, sig2 = mix.param$variance$sigmasq, : The use of 'sig2' is
deprecated; do specify 'sigma' (= sqrt(sig2)) instead

dds3$center <- droplevels(dds3$center)
bigpar()
cqnplot(fit, n=1, col=dds3$center)
legend("bottom", levels(dds3$center), fill=1:nlevels(dds3$center))

Downstream use of CQN

Above we show plots of the bias of the counts of sequenced reads over aspects of the features (genes) include sequence composition (GC) and the number of basepairs in the reduced exon ranges. The cqn output is also valuable for plotting corrected data, and for providing offsets for use in downstream statistical analysis.

names(fit)

 [1] "counts"      "lengths"     "sizeFactors" "subindex"    "y"           "x"          
 [7] "offset"      "offset0"     "glm.offset"  "func1"       "func2"       "grid1"      
[13] "grid2"       "knots1"      "knots2"      "call"

A natural log scale offset is provided in glm.offset. This can be converted back to count scale (similar to a size factor but across all genes x samples):

cqnOffset <- fit$glm.offset
cqnNormFactors <- exp(cqnOffset)

The cqn output also has count values that are corrected for differential bias across samples. We can compare before…

bigpar()
filter <- as.integer(rowSums(counts(dds3)) > 10 & mcols(dds)$len > 1000)
head(order(mcols(dds3)$gc * filter,
           decreasing=TRUE), 10)

 [1] 15820 32706  5674 53583 16699 15239 14765 57124 57134 57177

idx <- 32706
mcols(dds)[idx,]

DataFrame with 1 row and 2 columns
                         gc       len
                  <numeric> <integer>
ENSG00000234965.2  0.743772      1405

boxplot(counts(dds3, normalized=TRUE)[idx,] ~ dds3$center,
        ylim=c(0,300), col=1:2,
        xlab="center", ylab="scaled counts")

to after:

bigpar()
exprs <- fit$y + fit$offset # note: atypical definition of offset...
boxplot(exprs[idx,] ~ dds3$center, col=1:2,
        xlab="center", ylab="log2 scale expression")

While not all of the high GC features are exactly corrected, we see that for a number of them, the differences have been accounted for by modeling on technical covariates.

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] splines   stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] cqn_1.50.0                        quantreg_5.98                    
 [3] SparseM_1.84-2                    preprocessCore_1.66.0            
 [5] nor1mix_1.3-3                     mclust_6.1.1                     
 [7] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.72.0                  
 [9] rtracklayer_1.64.0                BiocIO_1.14.0                    
[11] Biostrings_2.72.1                 XVector_0.44.0                   
[13] rafalib_1.0.0                     GenomicFeatures_1.56.0           
[15] AnnotationDbi_1.66.0              DESeq2_1.44.0                    
[17] SummarizedExperiment_1.34.0       Biobase_2.64.0                   
[19] MatrixGenerics_1.16.0             matrixStats_1.3.0                
[21] GenomicRanges_1.56.1              GenomeInfoDb_1.40.1              
[23] IRanges_2.38.1                    S4Vectors_0.42.1                 
[25] BiocGenerics_0.50.0               testthat_3.2.1.1                 
[27] rmarkdown_2.27                    devtools_2.4.5                   
[29] usethis_3.0.0                    

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                bitops_1.0-8             remotes_2.5.0           
 [4] rlang_1.1.4              magrittr_2.0.3           compiler_4.4.1          
 [7] RSQLite_2.3.7            png_0.1-8                vctrs_0.6.5             
[10] stringr_1.5.1            profvis_0.3.8            pkgconfig_2.0.3         
[13] crayon_1.5.3             fastmap_1.2.0            ellipsis_0.3.2          
[16] utf8_1.2.4               Rsamtools_2.20.0         promises_1.3.0          
[19] sessioninfo_1.2.2        UCSC.utils_1.0.0         MatrixModels_0.5-3      
[22] purrr_1.0.2              bit_4.0.5                xfun_0.46               
[25] zlibbioc_1.50.0          cachem_1.1.0             jsonlite_1.8.8          
[28] blob_1.2.4               later_1.3.2              DelayedArray_0.30.1     
[31] BiocParallel_1.38.0      parallel_4.4.1           R6_2.5.1                
[34] RColorBrewer_1.1-3       stringi_1.8.4            pkgload_1.4.0           
[37] brio_1.1.5               Rcpp_1.0.13              knitr_1.48              
[40] httpuv_1.6.15            Matrix_1.7-0             tidyselect_1.2.1        
[43] rstudioapi_0.16.0        abind_1.4-5              yaml_2.3.10             
[46] codetools_0.2-20         miniUI_0.1.1.1           curl_5.2.1              
[49] pkgbuild_1.4.4           lattice_0.22-6           tibble_3.2.1            
[52] shiny_1.9.1              KEGGREST_1.44.1          evaluate_0.24.0         
[55] survival_3.7-0           urlchecker_1.0.1         pillar_1.9.0            
[58] generics_0.1.3           RCurl_1.98-1.16          ggplot2_3.5.1           
[61] munsell_0.5.1            scales_1.3.0             xtable_1.8-4            
[64] glue_1.7.0               tools_4.4.1              GenomicAlignments_1.40.0
[67] locfit_1.5-9.10          fs_1.6.4                 XML_3.99-0.17           
[70] grid_4.4.1               colorspace_2.1-1         GenomeInfoDbData_1.2.12 
[73] restfulr_0.0.15          cli_3.6.3                fansi_1.0.6             
[76] S4Arrays_1.4.1           dplyr_1.1.4              gtable_0.3.5            
[79] digest_0.6.36            SparseArray_1.4.8        rjson_0.2.21            
[82] htmlwidgets_1.6.4        memoise_2.0.1            htmltools_0.5.8.1       
[85] lifecycle_1.0.4          httr_1.4.7               mime_0.12               
[88] MASS_7.3-61              bit64_4.0.5