In this last lecture note on large data manipulation in R, we change tactics a bit. Previously we discussed fast reading and subsetting with data.table, and the advantages of SQLite vs HDF5 for storing collections of tables in a single file, and then working with these in R using the RSQLite and rhdf5 libraries. Here we discuss an alternative approach for dealing with large arrays in which many of the features are equal to zero. There are special classes and methods in R that allow us to work with such data in a memory and computationally efficient manner. These data are typically referred to as sparse data, in that the non-zero elements of the array are sparse. We will focus in this lecture note on the classes in the Matrix package, and some functionality in the glmnet package for fitting regularized linear or generalized linear models to sparse feature matrices.
Let’s dive right into representing sparse matrices. Here we have a large-ish matrix wherein the non-zero elements make up only ~5% of the total:
m <- matrix(rbinom(1e6, 1, .05), ncol=1e3)
m[1:5,1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
## [3,] 0 0 0 0 0
## [4,] 0 0 0 0 0
## [5,] 0 0 0 0 0
sum(m)
## [1] 50174
prod(dim(m))
## [1] 1e+06
This matrix takes up about 4 Mb in memory:
print(object.size(m), units="Mb")
## 3.8 Mb
That’s actually not so big that we encounter problems on a laptop computer, but if we multiply either or both of the dimensions by a factor of 1000, we will start to hit a limit in terms of working with the matrix.
Let’s get a sense of how much space we save if we represent this as a sparse matrix.
library(Matrix)
mm <- Matrix(m, sparse=TRUE)
mm[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . . .
## [2,] . . . . .
## [3,] . . . . .
## [4,] . . . . .
## [5,] . . . . .
sum(mm)
## [1] 50174
print(object.size(mm), units="Mb")
## 0.6 Mb
as.numeric(object.size(m)/object.size(mm))
## [1] 6.583721
The sparse version takes up less than 1/6 of the space of the dense version.
This coercion above, of a dense matrix into a sparse one doesn’t make
any sense: we would never want to first build the memory-intensive dense
version of the matrix and then convert down to the sparse version.
Instead, we would use the sparseMatrix
function to build
the matrix by specifying only the non-zero elements, and where they
occur.
First look up the help page for sparseMatrix
:
?sparseMatrix
The most common way to construct a sparse matrix would be to specify
i
, j
, and x
(this last argument
optional, if not included, the values will be equal to 1).
s <- sparseMatrix(i=c(1,3,5),j=c(1,2,3),x=c(4,5,6),dims=c(6,4))
s
## 6 x 4 sparse Matrix of class "dgCMatrix"
##
## [1,] 4 . . .
## [2,] . . . .
## [3,] . 5 . .
## [4,] . . . .
## [5,] . . 6 .
## [6,] . . . .
This creates an object of type dgCMatrix
. Take a look at
the help page for this class
?dgCMatrix-class
You can see that this class is column-oriented which means
it should be faster to index columns of these objects than rows.
Likewise, if we had not specified x
, it would also be
column-oriented by default, but instead it would be
ngCMatrix
. Let’s do a microbenchmark to see the degree of
difference. For this example, column indexing is about twice as
fast.
library(microbenchmark)
n <- 1e3
nn <- 1e5
s <- sparseMatrix(i=sample(n,nn,TRUE),
j=sample(n,nn,TRUE),
dims=c(n,n))
microbenchmark(sum(s[,10]),sum(s[10,]))
## Unit: microseconds
## expr min lq mean median uq max neval
## sum(s[, 10]) 308.9 345.1 446.742 361.5 383.85 6619.4 100
## sum(s[10, ]) 484.9 555.5 664.448 570.5 614.35 4488.3 100
We can do many operations to sparse matrices using specialized
functions which are different than the ones defined for regular
matrices. These are described in ?dgCMatrix-class
, but some
of the important ones are %*%
, crossprod
,
tcrossprod
, solve
, qr
,
lu
. Using these operations will preserve the sparsity of
the object (so keeping us under our memory budger), and will perform
much faster than coercion to dense would, if the matrices have a high
degree of sparsity.
Note that some operations destroy the sparsity, such as adding 1, and therefore must be avoided (in the case where the dense matrix would not fit in memory):
s[1:10,1:10] + 1
## 10 x 10 Matrix of class "dgeMatrix"
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2 2 1 1 1 1 1 1 1 1
## [2,] 1 1 1 1 1 1 1 2 1 1
## [3,] 1 1 1 2 1 1 1 1 1 1
## [4,] 1 1 1 1 1 1 1 1 1 1
## [5,] 1 1 1 1 1 1 1 1 1 1
## [6,] 1 1 1 1 1 1 1 1 1 1
## [7,] 1 1 1 1 1 1 1 1 1 1
## [8,] 1 1 2 1 2 1 1 2 2 1
## [9,] 1 1 1 1 1 2 2 1 1 1
## [10,] 1 1 1 1 1 1 1 1 1 1
Other operations maintain the sparsity:
s[1:10,1:10] * 2
## 10 x 10 sparse Matrix of class "dgCMatrix"
##
## [1,] 2 2 . . . . . . . .
## [2,] . . . . . . . 2 . .
## [3,] . . . 2 . . . . . .
## [4,] . . . . . . . . . .
## [5,] . . . . . . . . . .
## [6,] . . . . . . . . . .
## [7,] . . . . . . . . . .
## [8,] . . 2 . 2 . . 2 2 .
## [9,] . . . . . 2 2 . . .
## [10,] . . . . . . . . . .
We can also plot an image
, which avoids creating the
dense matrix:
image(s[1:100,1:100])
A common use case of sparse matrices is in prediction of a target,
let’s call y
, using a high-dimensional, sparse matrix of
features x
. We are often in situation that there are more
features in x
than there are observations (rows of
x
and length of y
). In this case it may make
sense to first try linear modeling of y
on x
,
and to use some combination of L1 and L2 regularization to stabilize the
regression. The glmnet package allows one to fit elastic net
models for such a problem, where the x
matrix can be
sparse, and it builds off of the sparse matrices defined in the
Matrix package. Read over the help file for the main
function:
## Loaded glmnet 4.1-6
library(glmnet)
?glmnet
We can mock up some simulated data to show the kinds of models that
can be fit using glmnet
. Here we simulate 50 columns of
x
with a coefficient of 1
and the rest of the
columns of x
are not used in constructing
y
.
n <- 1e3
nn <- 1e5
x <- sparseMatrix(i=sample(n,nn,TRUE),
j=sample(n,nn,TRUE),
dims=c(n,n))
beta <- rep(c(1,0),c(50,950))
y <- x %*% beta + rnorm(n,0,.25)
Running glmnet
gives us our regularization path. Here we
set alpha=1
which corresponds to only the L1 penalty
(lasso). Plotting the regularization path reveals a range of
lambda
where the 50 coefficients have been correctly
identified (non-zero coefficients) while the rest of the coefficients
have been shrunk to 0.
Notably, for this lecture note, we never had to convert
x
into its dense form, thereby allowing much higher
dimensions than would be possibly if glmnet
only took dense
matrices as input.
fit <- glmnet(x, y, family="gaussian", alpha=1)
plot(fit)