In this first lecture note on dealing with large datasets in R, we will introduce a specialized package for manipulating large tables of data, called data.table. For some background material, see the following links. The first link has useful speed comparisons with other packages as well as a table of basic operations, some of which we will cover in this lecture note, but others which go beyond what we cover here.
To motivate why we start with data.table, I refer to the following sections of the book, Efficient R Programming, by Colin Gillespie and Robin Lovelace which was mentioned earlier in the course:
We will also discuss data input and data processing below.
One of the initial hurdles for working with large datasets is simply reading in the data. data.table has a fast implementation fread
for fast read. Start by reading the help for fread
:
?fread
Note that it has a number of arguments, some of which are not the same as base R’s read.table
, e.g. stringsAsFactors=FALSE
and sep="auto"
. A natural comparison is fread
vs the read_csv
and read_tsv
functions in the readr package. In the first Efficient R Programming link above, these are compared and they state that for large files, e.g. > 100Mb, the fread
and read_csv
functions are about equal, and 5x faster than base R’s read.csv
.
Let’s compare fread and read.csv on a large file we will work with throughout the course, the College Scorecard dataset.
Briefly, the College Scorecard dataset is compiled by the Department of Education and has the following descriptive paragraph from their website:
The College Scorecard project is designed to increase transparency, putting the power in the hands of students and families to compare how well individual postsecondary institutions are preparing their students to be successful. This project provides data to help students and families compare college costs and outcomes as they weigh the tradeoffs of different colleges, accounting for their own needs and educational goals. These data are provided through federal reporting from institutions, data on federal financial aid, and tax information. These data provide insights into the performance of institutions that receive federal financial aid dollars, and the outcomes of the students of those institutions
We will discuss this dataset in more detail later, when we begin to model and find associations in the data, but for now just consider it as a large dataset (4.5 Gb uncompressed) in a series of comma-separated value (CSV) files.
We have downloaded the dataset from the website, and combined a number of the CSV files into a single file Scorecard_2009-2016.csv
. This file is constructed via the following shell commands:
for i in `ls MERGED201*`; do echo $i; tail -n +2 $i > ${i/.csv}_nohead.csv; done
cp MERGED2009_10_PP.csv MERGED2009_head.csv
cat MERGED*head.csv > Scorecard_2009-2016.csv
The merged file has 60,307 rows (including a column header) and 1899 columns. To demonstrate the speed of fread
, we will try just reading a subset of the full dataset. We can see that just on the first 10,000 rows, fread
is more than 3x faster than read.csv:
library(data.table)
n <- 10000
file <- "CollegeScorecard_Raw_Data/Scorecard_2009-2016.csv"
system.time({
scores <- fread(file, nrows=n)
})
## user system elapsed
## 4.616 0.120 4.737
system.time({
scores2 <- read.csv(file, nrows=n)
})
## user system elapsed
## 15.849 0.096 15.946
The output is a bit different as well:
class(scores)
## [1] "data.table" "data.frame"
class(scores2)
## [1] "data.frame"
Some syntax is shared with data.frame but there are also additional operations that are specially designed for speed and for reduced keystrokes. Here, NUMBRANCH
is a column of the data.table and so we can pull out certain rows by invoking the column name without having to write scores$NUMBRANCH
. This gives the scores which have more than 25 branches:
z <- scores[NUMBRANCH > 25]
nrow(z)
## [1] 253
A preview to later, we could have also gotten this number with some special data.table code, where .N
gives us the number of rows:
scores[NUMBRANCH > 25, .N]
## [1] 253
We can also pull out rows by matching on a string:
scores[INSTNM == "University of North Carolina at Chapel Hill",1:20]
## UNITID OPEID OPEID6 INSTNM
## 1: 199120 00297400 2974 University of North Carolina at Chapel Hill
## CITY STABBR ZIP ACCREDAGENCY INSTURL NPCURL SCH_DEG HCM2 MAIN
## 1: Chapel Hill NC 27599 NULL NULL NULL 3 NULL 1
## NUMBRANCH PREDDEG HIGHDEG CONTROL ST_FIPS REGION LOCALE
## 1: 1 3 4 1 37 5 NULL
We can also specify a column to be a key for the data.table. Specifying a key allows very fast subsetting based on the column you specify. Here, because the key is an integer, we wrap up the key in .()
, otherwise it would interpret our request as a row number:
setkey(scores, UNITID)
scores[.(199120),1:20]
## UNITID OPEID OPEID6 INSTNM
## 1: 199120 00297400 2974 University of North Carolina at Chapel Hill
## CITY STABBR ZIP ACCREDAGENCY INSTURL NPCURL SCH_DEG HCM2 MAIN
## 1: Chapel Hill NC 27599 NULL NULL NULL 3 NULL 1
## NUMBRANCH PREDDEG HIGHDEG CONTROL ST_FIPS REGION LOCALE
## 1: 1 3 4 1 37 5 NULL
It happens that the Id column of this dataset is an integer, but we could also have made a string into the key:
setkey(scores, CITY)
scores["Chapel Hill",1:20]
## UNITID OPEID OPEID6 INSTNM
## 1: 199120 00297400 2974 University of North Carolina at Chapel Hill
## 2: 455141 04140700 41407 Aveda Institute-Chapel Hill
## CITY STABBR ZIP ACCREDAGENCY INSTURL NPCURL SCH_DEG HCM2
## 1: Chapel Hill NC 27599 NULL NULL NULL 3 NULL
## 2: Chapel Hill NC 27514-7001 NULL NULL NULL 1 NULL
## MAIN NUMBRANCH PREDDEG HIGHDEG CONTROL ST_FIPS REGION LOCALE
## 1: 1 1 3 4 1 37 5 NULL
## 2: 1 1 1 1 3 37 5 NULL
As you can see the key does not have to be unique (unlike row names in R which must be unique). Subsetting with a key column using data.table is much faster than subsetting via other methods. The Data processing with data.table chapter of the Efficient R Programming book shows that subsetting a data.table by key is more than 60x faster than base R data.frame and more than 40x faster than using dplyr. The data.table website also has updated speed comparisons of data.table to pandas and dplyr, and included Spark and pydatatable, so you can get a sense of how different operations may differ across these packages. But the main takeaway should be that data.table is fast and if you have large datasets, you shouldn’t be using data.frame and base R functions for subsetting or grouping and summarization.
We can put functions inside of the square brackets.
We first convert TUITFTE
to numeric, which gives a warning about NAs introduced in the coercion step:
scores$TUITFTE <- as.numeric(scores$TUITFTE)
## Warning: NAs introduced by coercion
Here a trivial example, to calculate the mean of the tuition per FTE (full-time equivalent) student.
scores[,mean(TUITFTE,na.rm=TRUE)]
## [1] 8392.371
To make this a little easier to read, let’s define our own functions:
mean2 <- function(x) mean(x, na.rm=TRUE)
q25 <- function(x) quantile(x, .25, na.rm=TRUE)
q50 <- function(x) quantile(x, .50, na.rm=TRUE)
q75 <- function(x) quantile(x, .75, na.rm=TRUE)
Now again. This example is trivial as we could just as well computed the function after having extracted the column:
scores[,mean2(TUITFTE)]
## [1] 8392.371
mean2(scores$TUITFTE)
## [1] 8392.371
The power of putting the function inside the square brackets is that it can be combined easily with subsetting and grouping operations. For example:
scores[CONTROL==1,mean2(TUITFTE)]
## [1] 3482.954
Or with a grouping operation:
scores[,mean2(TUITFTE),by=CONTROL]
## CONTROL V1
## 1: 3 8899.570
## 2: 1 3482.954
## 3: 2 12981.861
We can also compute multiple functions of various columns, e.g. mean and standard deviation at the same time. We use the .()
operator which is synonymous with list()
.
scores[,.(median=q50(TUITFTE),q25=q25(TUITFTE),q75=q75(TUITFTE)),by=CONTROL]
## CONTROL median q25 q75
## 1: 3 7621.5 4823.25 11503.25
## 2: 1 2549.0 1381.00 4635.00
## 3: 2 11631.5 7775.00 16654.25
library(ggplot2)
dat <- scores[,.(median=q50(TUITFTE),q25=q25(TUITFTE),q75=q75(TUITFTE)),by=CONTROL]
ggplot(dat, aes(CONTROL, median, ymin=q25, ymax=q75)) + geom_pointrange() +
xlab("category") + ylab("TUITION / FTE")
Again, there are other complex functionality that can be performed with data.table, which can be looked over at the data.table website, but the operations above cover some of the most common use cases for reading, subsetting and computing on a large tabular dataset.