In this document, we will cover how considering certain datasets in the context of networks can help us understand statistical properties of the data on a more biologically meaningful level. We will use as an example dataset the list of DNA loops (chromatin interaction data) found using deeply sequenced in situ Hi-C. The published article is:

Static and Dynamic DNA Loops form AP-1-Bound Activation Hubs during Macrophage Development by Douglas Phanstiel

A preprint of the article can also be found here.

The article describes in detail the set of DNA loops which are gained and lost as a precursor monocyte differentiates into a mature macrophage. The summary of the article is below. Note in particular the highlighted sentence which discusses multi-loop hubs (or communities).

The three-dimensional arrangement of the human genome comprises a complex network of structural and regulatory chromatin loops important for coordinating changes in transcription during human development. To better understand the mechanisms underlying context-specific 3D chromatin structure and transcription during cellular differentiation, we generated comprehensive in situ Hi-C maps of DNA loops in human monocytes and differentiated macrophages. We demonstrate that dynamic looping events are regulatory rather than structural in nature and uncover widespread coordination of dynamic enhancer activity at preformed and acquired DNA loops. Enhancer-bound loop formation and enhancer activation of preformed loops together form multi-loop activation hubs at key macrophage genes. Activation hubs connect 3.4 enhancers per promoter and exhibit a strong enrichment for activator protein 1 (AP-1)-binding events, suggesting that multi-loop activation hubs involving cell-type-specific transcription factors represent an important class of regulatory chromatin structures for the spatiotemporal control of transcription.

A DNA loop as detected by Hi-C is a link between two regions of DNA, called anchors, detected by summing up non-duplicated fragments connecting 10 kb windows and performing some statistical test for the significance of seeing so high a count. As we can see in the summary above, a key part of the analysis was to consider how multiple DNA loops might form a hub. For example, suppose we have region 1 and region 2 detected as a loop, and region 3 and region 4 detected as a loop. If region 2 overlaps region 3 in the linear genome, then we can describe a community of loops which are in proximity to each other. The authors note that, when considering each community as the unit of study, they found 3.4 enhancers per promoter for those communities which were characterized as activation hubs. We will recapitulate these analyses below (albeit in a simple manner that the methods of the paper, simplified for pedagogical purposes).

Reading in DNA loop data

To download the data for this analysis, you can either use the following code chunk which does the downloading, unzipping and moving of files from within R; or you can download the files manually from GitHub. For the latter, go to the following GitHub repo:

https://github.com/biodatascience/network_data

…click the green button to Download ZIP to your working directory. We will use the three files in this repo in this document. You will need to unzip the directory and move the three files into your working directory.

You do not need to gunzip the three files themselves, we will read them in using read_delim.

url <- "https://github.com/biodatascience/network_data/archive/master.zip"
file <- "network_data.zip"
if (!file.exists(file)) {
  download.file(url, file)
  unzip(file)
  system("mv network_data-master/* .")
}

Now we will read in the data and create Bioconductor objects. The following code chunk reads in the full set of DNA loops (in hg19 coordinates), which are saved as two GRanges objects: anchor1 and anchor2. Here anchors refers to the two regions that form a DNA loop, as detected by Hi-C. We also assign the loop.status, which is whether the loop was gained (only seen in macrophage), static (seen in both cell types), or lost (only seen in monocytes).

library(readr)
library(GenomicRanges)
loops <- read_delim("Table_S3.Loops.Differential.txt.gz", delim="\t")
anchor1 <- GRanges(loops$anchor1_chrom,
                   IRanges(loops$anchor1_start, loops$anchor1_end))
anchor2 <- GRanges(loops$anchor2_chrom,
                   IRanges(loops$anchor2_start, loops$anchor2_end))
loop.status <- factor(loops$dynamic_loop_type)
table(loop.status)
## loop.status
## gained   lost static 
##    184     33  23313
anchor1$loop.status <- loop.status
anchor2$loop.status <- loop.status

We also read in the definitions (in hg19 coordinates) for enhancers in monocytes (O) and macrophages (A), as measured by H3K27ac ChIP-seq experiments. We put both sets of enhancers together into a single GRanges, and sort it by position.

enh.mono <- read_delim("MACSpeaks_H3K27ac_CI_THP1_O_peaks.narrowPeak.gz",
                       delim="\t", col_names=FALSE)
## 
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   X1 = col_character(),
##   X2 = col_double(),
##   X3 = col_double()
## )
enh.macro <- read_delim("MACSpeaks_H3K27ac_CI_THP1_A_peaks.narrowPeak.gz",
                        delim="\t", col_names=FALSE)
## 
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   X1 = col_character(),
##   X2 = col_double(),
##   X3 = col_double()
## )
enh <- GRanges(c(enh.mono$X1, enh.macro$X1),
               IRanges(start=c(enh.mono$X2, enh.macro$X2),
                       end=c(enh.mono$X3, enh.macro$X3)))
enh <- sort(enh)

We also need to define promoter regions for the hg19 genome, and for this we can use the UCSC known genes (this was the set of gene locations that was used in the article). Promoters are defined here as 2 kb windows upstream of the transcription start site (TSS).

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
g <- genes(txdb)
g <- keepStandardChromosomes(g, pruning.mode="coarse")
promoter <- promoters(g, upstream=2000, downstream=0)

Computing anchor, enhancer, and promoter overlaps

The following chunk of code annotates the anchors as either overlapping a promoter or enhancer, and then defines three groups of DNA loops: those connecting an enhancer to an enhancer, a promoter to a promoter, or an enhancer to a promoter. Finally, we define a group called “other” in which one or both of the anchors do not overlap an enhancer or promoter.

The first line below simply removes from the enhancer GRanges any regions overlapping a promoter. Note that, in the following categories, a loop can potentially be counted to multiple categories, because it is possible that an anchor might overlap both an enhancer and a promoter (although those two regions themselves do not overlap).

enh <- enh[!overlapsAny(enh, promoter)]
anchor1$promoter <- overlapsAny(anchor1, promoter)
anchor1$enhancer <- overlapsAny(anchor1, enh)
anchor2$promoter <- overlapsAny(anchor2, promoter)
anchor2$enhancer <- overlapsAny(anchor2, enh)
EE <- loop.status[anchor1$enhancer & anchor2$enhancer]
PP <- loop.status[anchor1$promoter & anchor2$promoter]
EP <- loop.status[(anchor1$enhancer & anchor2$promoter) |
                  (anchor2$enhancer & anchor1$promoter)]
other <- loop.status[!((anchor1$promoter | anchor1$enhancer) & 
                       (anchor2$promoter | anchor2$enhancer))]

We can then cross-tabulate the loop status for the different categories, and make a barplot to compare them. Compare with Figure 4E in the paper.

tab <- cbind(EE=table(EE), PP=table(PP), EP=table(EP), other=table(other))
prop.tab <- prop.table(tab,margin=1)
cols <- c("grey","brown")
barplot(prop.tab[c(3,1),-4], beside=TRUE, border=NA,
        col=cols, ylim=c(0,.6))
legend("topright", rownames(prop.tab)[c(3,1)], fill=cols,
       border=NA, cex=1.5, bty="n")

Stepping back

Looking at the barplot above, a first impression is that, when DNA loops are added during macrophage differentiation, most of what is going on is the addition of enhancer-enhancer interactions, and there are a few more enhancer-promoter interactions. As we hinted at in the beginning, this treats the DNA loop as the unit of study, while we might also consider multiple connected DNA loops as a community or hub. This is a good example of stepping back from an analysis and thinking critically about what is being measured and why. Another feature of this paper is Figure 4F which shows examples of 20 randomly chosen communities either containing or not containing gained loops. Just by looking at the example data, one can see that the communities tend to be larger when they contain a gained loop, and that most communities containing a gained loop contain a promoter.

With this in mind, the barplot above is interpreted in the paper as follows:

One possible explanation for the large proportion of enhancer-enhancer loops is the presence of interaction hubs involving a single promoter and multiple enhancers that all interact with each other. A fully connected hub with only one promoter and N enhancers would contain N enhancer-promoter interactions and (N)! / 2(N-2)! enhancer-enhancer interactions. For all values greater than N = 3, there would be more enhancer-enhancer loops than enhancer-promoter loops. To determine if our loops were forming such hubs we built interaction networks and detected communities of interacting anchor regions using a fast greedy modularity optimization algorithm (Clauset et al., 2004). We then classified these communities into two subsets: those containing a gained loop and those without….

We can use findOverlap to start connecting anchors to each other, to build networks. The following chunk of code constructs a non-redundant matrix of overlaps of DNA loop anchors.

fo1 <- as.matrix(findOverlaps(anchor1))
fo2 <- as.matrix(findOverlaps(anchor2))
fo12 <- as.matrix(findOverlaps(anchor1, anchor2))
overlaps <- rbind(fo1,fo2,fo12)
head(overlaps)
##      queryHits subjectHits
## [1,]         1           1
## [2,]         2           2
## [3,]         2        6163
## [4,]         3        5339
## [5,]         3       15833
## [6,]         3           3
overlaps <- overlaps[overlaps[,1] < overlaps[,2],]
overlaps <- overlaps[!duplicated(overlaps),]

Identifying connected components of a network

Once we have the matrix of which DNA loops connect to others, we can build a graph or network using the Bioconductor package RBGL. This package provides an R interface to the Boost graph library, a set of algorithms for computing on graphs (networks). The first function creates a graph from a matrix of edge definitions, and the second function outputs a list of connected components (communities) within the graph. A connected component is a set of nodes (here, DNA loops) that are connected by edges (here, loops which overlap other loops at at least one anchor in the linear genome).

library(RBGL)
graph <- ftM2graphNEL(overlaps, edgemode="undirected")
cc <- connectedComp(graph)
length(cc)
## [1] 1844

Visualizing a sub-graph using Rgraphviz:

library(Rgraphviz)
par(mfrow=c(2,2))
for (i in 1:4) {
  sg <- subGraph(cc[[i]], graph)
  plot(sg)
}

Computing statistics per community

Finally, we want to compare the communities which contain gained loops to those communities which do not (systematically make comparisons of the type in Figure 4F). To do so, we first compute a character vector of the numeric index of the loops which were categorized as gained in macrophage differentiation. We then iterate through all of the communities defined above to see which contained a gained loop. Finally, we calculate the size of the communities, and make a boxplot, clearly showing that gained loops belong to much larger communities, such that we would expect more enhancer-enhancer loops in the barplot above.

library(magrittr)
gained <- as.character(which(loop.status == "gained"))
community.with.gained <- sapply(cc, function(x) any(x %in% gained))
community.type <- factor(ifelse(community.with.gained, "gained", "static"))
community.type %<>% relevel("static")
community.size <- sapply(cc, length)
table(community.type)
## community.type
## static gained 
##   1718    126
boxplot(community.size ~ community.type, log="y", main="loops per community")

We can also tabulate the number of enhancers and promoters per community. Note that here we perform a slightly different analysis than in the paper, for simplicity of code. Here we add up the number of enhancers and promoters per loop, and then add this tally for all the loops in the community. This ends up double counting when anchors appear in multiple loops. A better analysis would focus on anchors here rather than loops (this was the analysis done in the paper).

num.enhancer <- sapply(cc, function(x) {
  i <- as.numeric(x)
  sum(anchor1$enhancer[i] + anchor2$enhancer[i])
})
num.promoter <- sapply(cc, function(x) {
  i <- as.numeric(x)
  sum(anchor1$promoter[i] + anchor2$promoter[i])
})
enh.to.promoter <- ifelse(num.promoter == 0, NA, num.enhancer / num.promoter)

We can see however that the number of enhancers-per-promoter increases for those communities that contain a gained loop:

boxplot(enh.to.promoter ~ community.type, main="E-to-P number", ylim=c(0,10))

static.E2P <- enh.to.promoter[community.type == "static"]
gained.E2P <- enh.to.promoter[community.type == "gained"]
mean(static.E2P, na.rm=TRUE)
## [1] 3.212206
mean(gained.E2P, na.rm=TRUE)
## [1] 4.35097
sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
##  [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] Rgraphviz_2.34.0                        magrittr_2.0.1                          RBGL_1.66.0                            
##  [4] graph_1.68.0                            TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.42.1                 
##  [7] AnnotationDbi_1.52.0                    Biobase_2.50.0                          GenomicRanges_1.42.0                   
## [10] GenomeInfoDb_1.26.2                     IRanges_2.24.1                          S4Vectors_0.28.1                       
## [13] BiocGenerics_0.36.0                     readr_1.4.0                             testthat_3.0.2                         
## [16] rmarkdown_2.7                           devtools_2.3.2                          usethis_2.0.1                          
## 
## loaded via a namespace (and not attached):
##  [1] matrixStats_0.58.0          bitops_1.0-6                fs_1.5.0                    bit64_4.0.5                
##  [5] progress_1.2.2              httr_1.4.2                  rprojroot_2.0.2             tools_4.0.4                
##  [9] bslib_0.2.4                 utf8_1.1.4                  R6_2.5.0                    DBI_1.1.1                  
## [13] withr_2.4.1                 tidyselect_1.1.0            prettyunits_1.1.1           processx_3.4.5             
## [17] bit_4.0.4                   curl_4.3                    compiler_4.0.4              cli_2.3.1                  
## [21] xml2_1.3.2                  DelayedArray_0.16.2         desc_1.3.0                  rtracklayer_1.50.0         
## [25] sass_0.3.1                  callr_3.5.1                 askpass_1.1                 rappdirs_0.3.3             
## [29] stringr_1.4.0               digest_0.6.27               Rsamtools_2.6.0             XVector_0.30.0             
## [33] pkgconfig_2.0.3             htmltools_0.5.1.1           sessioninfo_1.1.1           MatrixGenerics_1.2.1       
## [37] highr_0.8                   dbplyr_2.1.0                fastmap_1.1.0               rlang_0.4.10               
## [41] rstudioapi_0.13             RSQLite_2.2.3               jquerylib_0.1.3             generics_0.1.0             
## [45] jsonlite_1.7.2              BiocParallel_1.24.1         dplyr_1.0.5                 RCurl_1.98-1.2             
## [49] GenomeInfoDbData_1.2.4      Matrix_1.3-2                Rcpp_1.0.6                  fansi_0.4.2                
## [53] lifecycle_1.0.0             stringi_1.5.3               yaml_2.2.1                  SummarizedExperiment_1.20.0
## [57] zlibbioc_1.36.0             pkgbuild_1.2.0              BiocFileCache_1.14.0        blob_1.2.1                 
## [61] crayon_1.4.1                lattice_0.20-41             Biostrings_2.58.0           hms_1.0.0                  
## [65] knitr_1.31                  ps_1.6.0                    pillar_1.5.0                codetools_0.2-18           
## [69] biomaRt_2.46.3              pkgload_1.2.0               XML_3.99-0.5                glue_1.4.2                 
## [73] evaluate_0.14               remotes_2.2.0               BiocManager_1.30.10         vctrs_0.3.6                
## [77] openssl_1.4.3               purrr_0.3.4                 assertthat_0.2.1            cachem_1.0.4               
## [81] xfun_0.22                   tibble_3.1.0                GenomicAlignments_1.26.0    memoise_2.0.0              
## [85] ellipsis_0.3.1