Introduction

In previous lectures, we have discussed reading in large data tables, and working with large databases via SQLite. Here, we discuss a middle way, using the popular HDF5 format. The motivation for using an HDF5 data container is that, like SQLite we have a common format for representing a complex set of tables that can be shared simply be sharing a file, but unlike SQLite we are typically interested in reading in entire tables into memory, so that we can then analyze them. HDF5 is typically smaller on disk, as well as faster for writing or reading to or from disk, compared to SQLite.

First some information from the HDF5 group, on Why HDF5?

An HDF5 data container is a standardized, highly-customizable data receptacle designed for portability. Unless your definition of ‘container’ is extremely broad, file systems are not commonly considered containers.

File systems aren’t portable: For example, you might be able to mount an NTFS file system on an AIX machine, but the integers or floating point numbers written on an Intel processor will turn out to be garbage when read on a IBM Power processor.

HDF5 achieves portability by separating its “cargo” (data) from its environment (file system, processor architecture, etc.) and by encoding it in a self-describing file format. The HDF5 library serves the dual purpose of being a parser/encoder of this format and an API for user-level objects (datasets, groups, attributes, etc.).

The data stored in HDF5 datasets is shaped and it is typed. Datasets have (logically) the shape of multi-dimensional rectilinear arrays. All elements in a given dataset are of the same type, and HDF5 has one of the most extensive type systems and one that is user-extendable.

The rhdf5 package

As we are focusing on how to interface with various large data formats in R, we now introduce the rhdf5 package. Unlike some of the other packages we have shown, this package is maintained on the Bioconductor repository and so has a special installation.

install.packages("BiocManager") # can be skipped after 1st time
BiocManager::install("rhdf5")

Now we can load the package. Much of the following introduction to rhdf5 is modified from the package vignette.

library(rhdf5)

Typically, we may already have an HDF5 data container that we want to work with, but as in the SQLite lecture note, we will show how to create a new one first.

h5file <- "myDB.h5"
h5createFile(h5file)

Groups are like directories

HDF5 containers have a hierarchy built around groups which act and look a bit like directories:

h5createGroup(h5file, "A")
## [1] TRUE
h5createGroup(h5file, "B")
## [1] TRUE
h5createGroup(h5file, "A/C")
## [1] TRUE

We can list the groups:

h5ls(h5file)
##   group name     otype dclass dim
## 0     /    A H5I_GROUP           
## 1    /A    C H5I_GROUP           
## 2     /    B H5I_GROUP

Finally, we show some examples of writing data to the HDF5 container, with h5write. Row and column names of matrices or arrays in general will not be stored, however the column names of compound data types (such as data.frame) will be stored:

x <- matrix(rnorm(1e4),nrow=100)
h5write(x, h5file, "A/x")
y <- matrix(letters, nrow=13)
h5write(y, h5file,"A/C/y")
df <- data.frame(a=1L:5L,
                 b=seq(0,1,length.out=5),
                 c=letters[1:5],
                 stringsAsFactors=FALSE)
h5write(df, h5file, "B/df")
h5ls(h5file)
##   group name       otype   dclass       dim
## 0     /    A   H5I_GROUP                   
## 1    /A    C   H5I_GROUP                   
## 2  /A/C    y H5I_DATASET   STRING    13 x 2
## 3    /A    x H5I_DATASET    FLOAT 100 x 100
## 4     /    B   H5I_GROUP                   
## 5    /B   df H5I_DATASET COMPOUND         5

Reading objects

We can read out these objects using h5read. Note that the column names of the data.frame have been preserved:

xx <- h5read(h5file, "A/x")
xx[1:3,1:3]
##            [,1]       [,2]       [,3]
## [1,] -2.9180159 -0.3099286  0.5671834
## [2,]  1.2320955 -1.5603322 -0.7619277
## [3,] -0.3517632  0.2978257  0.9193802
yy <- h5read(h5file, "A/C/y")
head(yy)
##      [,1] [,2]
## [1,] "a"  "n" 
## [2,] "b"  "o" 
## [3,] "c"  "p" 
## [4,] "d"  "q" 
## [5,] "e"  "r" 
## [6,] "f"  "s"
df2 <- h5read(h5file, "B/df")
head(df2)
##   a    b c
## 1 1 0.00 a
## 2 2 0.25 b
## 3 3 0.50 c
## 4 4 0.75 d
## 5 5 1.00 e

Integration with Rcpp

During package development, you may find that it would be easier to directly read from or write to an HDF5 file directly from your C++ code. RcppArmadillo allows for this functionality, detailed in their documentation. If you search for hdf5 at this link you will find a few options for loading and saving objects in this format.

One caveat listed in their documentation is the following:

Caveat: for saving/loading HDF5 files, support for HDF5 must be enabled within Armadillo’s configuration; the hdf5.h header file must be available on your system and you will need to link with the HDF5 library (eg. -lhdf5)

This can be achieved by adding a Makevars or Makevars.win file to your package’s src/ directory indicating this. General information on Makevars files can be found here. A specific walkthough on how to do it in this specific instance using HDF5 is given here. An example of using the hdf5 library in practice can be found here. This example uses the “H5Cpp.h” header instead of the “hdf5.h”, both of which are referenced in the Rhdf5lib link earlier.

DelayedArray

The DelayedArray Bioconductor package offers a an R-friendly way to work with datasets too large to load into memory, and can also leverage some of the advantages of the HDF5 format via the HDF5Array package. Additional packages such as DelayedMatrixStats can be used to perform operations on DelayedMatrix objects from the DelayedArray package.