In previous lectures, we have discussed reading in large data tables, and working with large databases via SQLite. Here, we discuss a middle way, using the popular HDF5 format. The motivation for using an HDF5 data container is that, like SQLite we have a common format for representing a complex set of tables that can be shared simply be sharing a file, but unlike SQLite we are typically interested in reading in entire tables into memory, so that we can then analyze them. HDF5 is typically smaller on disk, as well as faster for writing or reading to or from disk, compared to SQLite.
First some information from the HDF5 group, on Why HDF5?
An HDF5 data container is a standardized, highly-customizable data receptacle designed for portability. Unless your definition of ‘container’ is extremely broad, file systems are not commonly considered containers.
File systems aren’t portable: For example, you might be able to mount an NTFS file system on an AIX machine, but the integers or floating point numbers written on an Intel processor will turn out to be garbage when read on a IBM Power processor.
HDF5 achieves portability by separating its “cargo” (data) from its environment (file system, processor architecture, etc.) and by encoding it in a self-describing file format. The HDF5 library serves the dual purpose of being a parser/encoder of this format and an API for user-level objects (datasets, groups, attributes, etc.).
…
The data stored in HDF5 datasets is shaped and it is typed. Datasets have (logically) the shape of multi-dimensional rectilinear arrays. All elements in a given dataset are of the same type, and HDF5 has one of the most extensive type systems and one that is user-extendable.
As we are focusing on how to interface with various large data formats in R, we now introduce the rhdf5 package. Unlike some of the other packages we have shown, this package is maintained on the Bioconductor repository and so has a special installation.
install.packages("BiocManager") # can be skipped after 1st time
BiocManager::install("rhdf5")
Now we can load the package. Much of the following introduction to rhdf5 is modified from the package vignette.
library(rhdf5)
Typically, we may already have an HDF5 data container that we want to work with, but as in the SQLite lecture note, we will show how to create a new one first.
h5file <- "myDB.h5"
h5createFile(h5file)
HDF5 containers have a hierarchy built around groups which act and look a bit like directories:
h5createGroup(h5file, "A")
## [1] TRUE
h5createGroup(h5file, "B")
## [1] TRUE
h5createGroup(h5file, "A/C")
## [1] TRUE
We can list the groups:
h5ls(h5file)
## group name otype dclass dim
## 0 / A H5I_GROUP
## 1 /A C H5I_GROUP
## 2 / B H5I_GROUP
Finally, we show some examples of writing data to the HDF5 container, with h5write
. Row and column names of matrices or arrays in general will not be stored, however the column names of compound data types (such as data.frame) will be stored:
x <- matrix(rnorm(1e4),nrow=100)
h5write(x, h5file, "A/x")
y <- matrix(letters, nrow=13)
h5write(y, h5file,"A/C/y")
df <- data.frame(a=1L:5L,
b=seq(0,1,length.out=5),
c=letters[1:5],
stringsAsFactors=FALSE)
h5write(df, h5file, "B/df")
h5ls(h5file)
## group name otype dclass dim
## 0 / A H5I_GROUP
## 1 /A C H5I_GROUP
## 2 /A/C y H5I_DATASET STRING 13 x 2
## 3 /A x H5I_DATASET FLOAT 100 x 100
## 4 / B H5I_GROUP
## 5 /B df H5I_DATASET COMPOUND 5
We can read out these objects using h5read
. Note that the column names of the data.frame have been preserved:
xx <- h5read(h5file, "A/x")
xx[1:3,1:3]
## [,1] [,2] [,3]
## [1,] -2.9180159 -0.3099286 0.5671834
## [2,] 1.2320955 -1.5603322 -0.7619277
## [3,] -0.3517632 0.2978257 0.9193802
yy <- h5read(h5file, "A/C/y")
head(yy)
## [,1] [,2]
## [1,] "a" "n"
## [2,] "b" "o"
## [3,] "c" "p"
## [4,] "d" "q"
## [5,] "e" "r"
## [6,] "f" "s"
df2 <- h5read(h5file, "B/df")
head(df2)
## a b c
## 1 1 0.00 a
## 2 2 0.25 b
## 3 3 0.50 c
## 4 4 0.75 d
## 5 5 1.00 e
During package development, you may find that it would be easier to directly read from or write to an HDF5 file directly from your C++ code. RcppArmadillo allows for this functionality, detailed in their documentation. If you search for hdf5 at this link you will find a few options for loading and saving objects in this format.
One caveat listed in their documentation is the following:
Caveat: for saving/loading HDF5 files, support for HDF5 must be enabled within Armadillo’s configuration; the hdf5.h header file must be available on your system and you will need to link with the HDF5 library (eg. -lhdf5)
This can be achieved by adding a Makevars or Makevars.win file to your package’s src/ directory indicating this. General information on Makevars files can be found here. A specific walkthough on how to do it in this specific instance using HDF5 is given here. An example of using the hdf5 library in practice can be found here. This example uses the “H5Cpp.h” header instead of the “hdf5.h”, both of which are referenced in the Rhdf5lib link earlier.
The DelayedArray Bioconductor package offers a an R-friendly way to work with datasets too large to load into memory, and can also leverage some of the advantages of the HDF5 format via the HDF5Array package. Additional packages such as DelayedMatrixStats can be used to perform operations on DelayedMatrix objects from the DelayedArray package.