In this last lecture note on large data manipulation in R, we change tactics a bit. Previously we discussed fast reading and subsetting with data.table, and the advantages of SQLite vs HDF5 for storing collections of tables in a single file, and then working with these in R using the RSQLite and rhdf5 libraries. Here we discuss an alternative approach for dealing with large arrays in which many of the features are equal to zero. There are special classes and methods in R that allow us to work with such data in a memory and computationally efficient manner. These data are typically referred to as sparse data, in that the non-zero elements of the array are sparse. We will focus in this lecture note on the classes in the Matrix package, and some functionality in the glmnet package for fitting regularized linear or generalized linear models to sparse feature matrices.

Representing sparse matrices

Let’s dive right into representing sparse matrices. Here we have a large-ish matrix wherein the non-zero elements make up only ~5% of the total:

m <- matrix(rbinom(1e6, 1, .05), ncol=1e3)
m[1:5,1:5]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    0    0    0    0    0
## [5,]    0    0    0    0    0
sum(m)
## [1] 50174
prod(dim(m))
## [1] 1e+06

This matrix takes up about 4 Mb in memory:

print(object.size(m), units="Mb")
## 3.8 Mb

That’s actually not so big that we encounter problems on a laptop computer, but if we multiply either or both of the dimensions by a factor of 1000, we will start to hit a limit in terms of working with the matrix.

Let’s get a sense of how much space we save if we represent this as a sparse matrix.

library(Matrix)
mm <- Matrix(m, sparse=TRUE)
mm[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##               
## [1,] . . . . .
## [2,] . . . . .
## [3,] . . . . .
## [4,] . . . . .
## [5,] . . . . .
sum(mm)
## [1] 50174
print(object.size(mm), units="Mb")
## 0.6 Mb
as.numeric(object.size(m)/object.size(mm))
## [1] 6.583721

The sparse version takes up less than 1/6 of the space of the dense version.

How to construct sparse matrices

This coercion above, of a dense matrix into a sparse one doesn’t make any sense: we would never want to first build the memory-intensive dense version of the matrix and then convert down to the sparse version. Instead, we would use the sparseMatrix function to build the matrix by specifying only the non-zero elements, and where they occur.

First look up the help page for sparseMatrix:

?sparseMatrix

The most common way to construct a sparse matrix would be to specify i, j, and x (this last argument optional, if not included, the values will be equal to 1).

s <- sparseMatrix(i=c(1,3,5),j=c(1,2,3),x=c(4,5,6),dims=c(6,4))
s
## 6 x 4 sparse Matrix of class "dgCMatrix"
##             
## [1,] 4 . . .
## [2,] . . . .
## [3,] . 5 . .
## [4,] . . . .
## [5,] . . 6 .
## [6,] . . . .

This creates an object of type dgCMatrix. Take a look at the help page for this class

?dgCMatrix-class

You can see that this class is column-oriented which means it should be faster to index columns of these objects than rows. Likewise, if we had not specified x, it would also be column-oriented by default, but instead it would be ngCMatrix. Let’s do a microbenchmark to see the degree of difference. For this example, column indexing is about twice as fast.

library(microbenchmark)
n <- 1e3
nn <- 1e5
s <- sparseMatrix(i=sample(n,nn,TRUE),
                  j=sample(n,nn,TRUE),
                  dims=c(n,n))
microbenchmark(sum(s[,10]),sum(s[10,]))
## Unit: microseconds
##          expr   min    lq    mean median     uq    max neval
##  sum(s[, 10]) 308.9 345.1 446.742  361.5 383.85 6619.4   100
##  sum(s[10, ]) 484.9 555.5 664.448  570.5 614.35 4488.3   100

Manipulating sparse matrices

We can do many operations to sparse matrices using specialized functions which are different than the ones defined for regular matrices. These are described in ?dgCMatrix-class, but some of the important ones are %*%, crossprod, tcrossprod, solve, qr, lu. Using these operations will preserve the sparsity of the object (so keeping us under our memory budger), and will perform much faster than coercion to dense would, if the matrices have a high degree of sparsity.

Note that some operations destroy the sparsity, such as adding 1, and therefore must be avoided (in the case where the dense matrix would not fit in memory):

s[1:10,1:10] + 1
## 10 x 10 Matrix of class "dgeMatrix"
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    2    2    1    1    1    1    1    1    1     1
##  [2,]    1    1    1    1    1    1    1    2    1     1
##  [3,]    1    1    1    2    1    1    1    1    1     1
##  [4,]    1    1    1    1    1    1    1    1    1     1
##  [5,]    1    1    1    1    1    1    1    1    1     1
##  [6,]    1    1    1    1    1    1    1    1    1     1
##  [7,]    1    1    1    1    1    1    1    1    1     1
##  [8,]    1    1    2    1    2    1    1    2    2     1
##  [9,]    1    1    1    1    1    2    2    1    1     1
## [10,]    1    1    1    1    1    1    1    1    1     1

Other operations maintain the sparsity:

s[1:10,1:10] * 2
## 10 x 10 sparse Matrix of class "dgCMatrix"
##                          
##  [1,] 2 2 . . . . . . . .
##  [2,] . . . . . . . 2 . .
##  [3,] . . . 2 . . . . . .
##  [4,] . . . . . . . . . .
##  [5,] . . . . . . . . . .
##  [6,] . . . . . . . . . .
##  [7,] . . . . . . . . . .
##  [8,] . . 2 . 2 . . 2 2 .
##  [9,] . . . . . 2 2 . . .
## [10,] . . . . . . . . . .

We can also plot an image, which avoids creating the dense matrix:

image(s[1:100,1:100])

Use of sparse matrices in glmnet

A common use case of sparse matrices is in prediction of a target, let’s call y, using a high-dimensional, sparse matrix of features x. We are often in situation that there are more features in x than there are observations (rows of x and length of y). In this case it may make sense to first try linear modeling of y on x, and to use some combination of L1 and L2 regularization to stabilize the regression. The glmnet package allows one to fit elastic net models for such a problem, where the x matrix can be sparse, and it builds off of the sparse matrices defined in the Matrix package. Read over the help file for the main function:

## Loaded glmnet 4.1-6
library(glmnet)
?glmnet

We can mock up some simulated data to show the kinds of models that can be fit using glmnet. Here we simulate 50 columns of x with a coefficient of 1 and the rest of the columns of x are not used in constructing y.

n <- 1e3
nn <- 1e5
x <- sparseMatrix(i=sample(n,nn,TRUE),
                  j=sample(n,nn,TRUE),
                  dims=c(n,n))
beta <- rep(c(1,0),c(50,950))
y <- x %*% beta + rnorm(n,0,.25)

Running glmnet gives us our regularization path. Here we set alpha=1 which corresponds to only the L1 penalty (lasso). Plotting the regularization path reveals a range of lambda where the 50 coefficients have been correctly identified (non-zero coefficients) while the rest of the coefficients have been shrunk to 0.

Notably, for this lecture note, we never had to convert x into its dense form, thereby allowing much higher dimensions than would be possibly if glmnet only took dense matrices as input.

fit <- glmnet(x, y, family="gaussian", alpha=1)
plot(fit)