Introduction

In this first lecture note on dealing with large datasets in R, we will introduce a specialized package for manipulating large tables of data, called data.table. For some background material, see the following links. The first link has useful speed comparisons with other packages as well as a table of basic operations, some of which we will cover in this lecture note, but others which go beyond what we cover here.

To motivate why we start with data.table, I refer to the following sections of the book, Efficient R Programming, by Colin Gillespie and Robin Lovelace which was mentioned earlier in the course:

We will also discuss data input and data processing below.

Reading large data from plain text

One of the initial hurdles for working with large datasets is simply reading in the data. data.table has a fast implementation fread for fast read. Start by reading the help for fread:

?fread

Note that it has a number of arguments, some of which are not the same as base R’s read.table, e.g. stringsAsFactors=FALSE and sep="auto". A natural comparison is fread vs the read_csv and read_tsv functions in the readr package. In the first Efficient R Programming link above, these are compared and they state that for large files, e.g. > 100Mb, the fread and read_csv functions are about equal, and 5x faster than base R’s read.csv.

Let’s compare fread and read.csv on a large file we will work with throughout the course, the College Scorecard dataset.

Reading in College Scorecard dataset

Briefly, the College Scorecard dataset is compiled by the Department of Education and has the following descriptive paragraph from their website:

The College Scorecard project is designed to increase transparency, putting the power in the hands of students and families to compare how well individual postsecondary institutions are preparing their students to be successful. This project provides data to help students and families compare college costs and outcomes as they weigh the tradeoffs of different colleges, accounting for their own needs and educational goals. These data are provided through federal reporting from institutions, data on federal financial aid, and tax information. These data provide insights into the performance of institutions that receive federal financial aid dollars, and the outcomes of the students of those institutions

We will discuss this dataset in more detail later, when we begin to model and find associations in the data, but for now just consider it as a large dataset (4.5 Gb uncompressed) in a series of comma-separated value (CSV) files.

We have downloaded the dataset from the website, and combined a number of the CSV files into a single file Scorecard_2009-2016.csv. This file is constructed via the following shell commands:

for i in `ls MERGED201*`; do echo $i; tail -n +2 $i > ${i/.csv}_nohead.csv; done
cp MERGED2009_10_PP.csv MERGED2009_head.csv
cat MERGED*head.csv > Scorecard_2009-2016.csv

The merged file has 60,307 rows (including a column header) and 1899 columns. To demonstrate the speed of fread, we will try just reading a subset of the full dataset. We can see that just on the first 10,000 rows, fread is more than 3x faster than read.csv:

library(data.table)
n <- 10000
file <- "CollegeScorecard_Raw_Data/Scorecard_2009-2016.csv"
system.time({
  scores <- fread(file, nrows=n)
})

##    user  system elapsed 
##   4.616   0.120   4.737

system.time({
  scores2 <- read.csv(file, nrows=n)
})

##    user  system elapsed 
##  15.849   0.096  15.946

The output is a bit different as well:

class(scores)

## [1] "data.table" "data.frame"

class(scores2)

## [1] "data.frame"

Data manipulation with data.table

Some syntax is shared with data.frame but there are also additional operations that are specially designed for speed and for reduced keystrokes. Here, NUMBRANCH is a column of the data.table and so we can pull out certain rows by invoking the column name without having to write scores$NUMBRANCH. This gives the scores which have more than 25 branches:

z <- scores[NUMBRANCH > 25]
nrow(z)

## [1] 253

A preview to later, we could have also gotten this number with some special data.table code, where .N gives us the number of rows:

scores[NUMBRANCH > 25, .N]

## [1] 253

We can also pull out rows by matching on a string:

scores[INSTNM == "University of North Carolina at Chapel Hill",1:20]

##    UNITID    OPEID OPEID6                                      INSTNM
## 1: 199120 00297400   2974 University of North Carolina at Chapel Hill
##           CITY STABBR   ZIP ACCREDAGENCY INSTURL NPCURL SCH_DEG HCM2 MAIN
## 1: Chapel Hill     NC 27599         NULL    NULL   NULL       3 NULL    1
##    NUMBRANCH PREDDEG HIGHDEG CONTROL ST_FIPS REGION LOCALE
## 1:         1       3       4       1      37      5   NULL

We can also specify a column to be a key for the data.table. Specifying a key allows very fast subsetting based on the column you specify. Here, because the key is an integer, we wrap up the key in .(), otherwise it would interpret our request as a row number:

setkey(scores, UNITID)
scores[.(199120),1:20]

##    UNITID    OPEID OPEID6                                      INSTNM
## 1: 199120 00297400   2974 University of North Carolina at Chapel Hill
##           CITY STABBR   ZIP ACCREDAGENCY INSTURL NPCURL SCH_DEG HCM2 MAIN
## 1: Chapel Hill     NC 27599         NULL    NULL   NULL       3 NULL    1
##    NUMBRANCH PREDDEG HIGHDEG CONTROL ST_FIPS REGION LOCALE
## 1:         1       3       4       1      37      5   NULL

It happens that the Id column of this dataset is an integer, but we could also have made a string into the key:

setkey(scores, CITY)
scores["Chapel Hill",1:20]

##    UNITID    OPEID OPEID6                                      INSTNM
## 1: 199120 00297400   2974 University of North Carolina at Chapel Hill
## 2: 455141 04140700  41407                 Aveda Institute-Chapel Hill
##           CITY STABBR        ZIP ACCREDAGENCY INSTURL NPCURL SCH_DEG HCM2
## 1: Chapel Hill     NC      27599         NULL    NULL   NULL       3 NULL
## 2: Chapel Hill     NC 27514-7001         NULL    NULL   NULL       1 NULL
##    MAIN NUMBRANCH PREDDEG HIGHDEG CONTROL ST_FIPS REGION LOCALE
## 1:    1         1       3       4       1      37      5   NULL
## 2:    1         1       1       1       3      37      5   NULL

As you can see the key does not have to be unique (unlike row names in R which must be unique). Subsetting with a key column using data.table is much faster than subsetting via other methods. The Data processing with data.table chapter of the Efficient R Programming book shows that subsetting a data.table by key is more than 60x faster than base R data.frame and more than 40x faster than using dplyr. The data.table website also has updated speed comparisons of data.table to pandas and dplyr, and included Spark and pydatatable, so you can get a sense of how different operations may differ across these packages. But the main takeaway should be that data.table is fast and if you have large datasets, you shouldn’t be using data.frame and base R functions for subsetting or grouping and summarization.

Functions inside the brackets

We can put functions inside of the square brackets.

We first convert TUITFTE to numeric, which gives a warning about NAs introduced in the coercion step:

scores$TUITFTE <- as.numeric(scores$TUITFTE)

## Warning: NAs introduced by coercion

Here a trivial example, to calculate the mean of the tuition per FTE (full-time equivalent) student.

scores[,mean(TUITFTE,na.rm=TRUE)]

## [1] 8392.371

To make this a little easier to read, let’s define our own functions:

mean2 <- function(x) mean(x, na.rm=TRUE)
q25 <- function(x) quantile(x, .25, na.rm=TRUE)
q50 <- function(x) quantile(x, .50, na.rm=TRUE)
q75 <- function(x) quantile(x, .75, na.rm=TRUE)

Now again. This example is trivial as we could just as well computed the function after having extracted the column:

scores[,mean2(TUITFTE)]

## [1] 8392.371

mean2(scores$TUITFTE)

## [1] 8392.371

Grouping operations

The power of putting the function inside the square brackets is that it can be combined easily with subsetting and grouping operations. For example:

scores[CONTROL==1,mean2(TUITFTE)]

## [1] 3482.954

Or with a grouping operation:

scores[,mean2(TUITFTE),by=CONTROL]

##    CONTROL        V1
## 1:       3  8899.570
## 2:       1  3482.954
## 3:       2 12981.861

We can also compute multiple functions of various columns, e.g. mean and standard deviation at the same time. We use the .() operator which is synonymous with list().

scores[,.(median=q50(TUITFTE),q25=q25(TUITFTE),q75=q75(TUITFTE)),by=CONTROL]

##    CONTROL  median     q25      q75
## 1:       3  7621.5 4823.25 11503.25
## 2:       1  2549.0 1381.00  4635.00
## 3:       2 11631.5 7775.00 16654.25

library(ggplot2)
dat <- scores[,.(median=q50(TUITFTE),q25=q25(TUITFTE),q75=q75(TUITFTE)),by=CONTROL]
ggplot(dat, aes(CONTROL, median, ymin=q25, ymax=q75)) + geom_pointrange() +
  xlab("category") + ylab("TUITION / FTE")

Again, there are other complex functionality that can be performed with data.table, which can be looked over at the data.table website, but the operations above cover some of the most common use cases for reading, subsetting and computing on a large tabular dataset.

Working with data.table

Michael Love