In the previous lecture note, we introduced the data.table package and showed how it can be used to read in large datasets into R (so storing the dataset in memory), and then how specialized functions allow for fast subsetting and grouping/summarization operations. This works fairly well for many tasks on large tabular data until we hit the limit in terms of the size of dataset that can be read into memory. After we hit this memory limit, we can turn instead to on-disk storage of tables of data, and a convenient format for this is SQLite. A critical design point of SQLite (from the Wikipedia page):
SQLite stores the entire database (definitions, tables, indices, and the data itself) as a single cross-platform file on a host machine. It implements this simple design by locking the entire database file during writing. SQLite read operations can be multitasked, though writes can only be performed sequentially.
We will jump right in to trying out a connection to a SQLite database. We use the RSQLite package which provides an interface to the SQLite library, and the DBI package which provides a generic interface from R to various database backends. The following example is derived from the example code in ?SQLite
, which has additional information on working with RSQLite.
The following will connect to a database myDB.sqlite
and if it does not exist, it will create the file:
library(RSQLite)
library(DBI)
con <- dbConnect(SQLite(), "myDB.sqlite")
con
## <SQLiteConnection>
## Path: /home/love/teach/statcomp/statcomp_src/large/myDB.sqlite
## Extensions: TRUE
If we wanted to try out the RQLite package without writing a file to disk we could have also used ":memory:"
instead of writing a filename, which creates an in-memory database.
Let’s write a table from R to the database. Typically you would most likely just be reading very large databases from SQLite rather than writing tables, but we do so as an example anyway:
data(mtcars)
dbWriteTable(con, "cars", mtcars)
dbListTables(con)
## [1] "cars"
We can then pull rows of data from the table using standard SQL-style queries. If you’ve never performed SQL queries before, it’s pretty easy to learn by example, and w3schools has a reference for learning or reviewing if you haven’t seen this in a while.
The following pulls all rows from the cars
table:
rows <- dbGetQuery(con, "SELECT * FROM cars")
head(rows)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
nrow(rows)
## [1] 32
We can also select subsets of the data easily:
rows <- dbGetQuery(con, "SELECT * FROM cars WHERE cyl=6")
head(rows)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 4 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 5 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## 6 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
nrow(rows)
## [1] 7
However, the whole motivation in this lecture note was that we potentially have more data than can fit in memory, and so we can also fetch data from the table in chunks. Here we formulate a query rs
, and then fetch 10 rows at a time with dbFetch
:
rs <- dbSendQuery(con, "SELECT * FROM cars")
d1 <- dbFetch(rs, n=10)
dbHasCompleted(rs)
## [1] FALSE
We can continue to fetch batches of 10 (or any number), or we can extract all remaining data by specifying -1:
d2 <- dbFetch(rs, n=-1)
dbHasCompleted(rs)
## [1] TRUE
dbClearResult(rs)
Finally, we close the connection when we are finished working with the database:
dbDisconnect(con)
This short lecture note was to give a brief overview to the use of RSQLite as an interface to SQLite on-disk databases. These are very powerful ways to work with large data when the datasets no longer fit into memory, or generally as a way to share datasets as a single file and in a format that is incredibly widely used and well tested. We do not teach SQL queries in this class, as these are fairly easy to learn on your own through reading over example queries or trying them out on example datasets as shown here.