Why put your code into a package?

There are two compelling reasons to move your code from a set of scripts to a package:

  • Versioning
  • Documentation

During a project, you may begin to accumulate a large code base, you can get by organizing functions into files, and using source, for example:

# hypothetical example:
load("data.rda")
source("normalize_functions.R")
source("EM_functions.R")
source("plot_functions.R")

This may be sufficient for a while, but it is not sufficient for sharing the code with others, either for reproducibility or for having others use your methods. Using source as above doesn’t let you or someone else know which version of the functions are being used. The way to do this in R is to put the R scripts into a package structure which has a version attached to it. The version is simply recorded as a line in a text file, which we will describe in these notes.

A potential user also cannot find out about the purpose of the functions, details about the arguments or output of the function if they are simply brought into R with source. You could provide comments in the R files, but these comments are not accessible from the command line while a potential user is about to use your function. Help for R functions is provided in a specifically formatted file (ending with .Rd) that goes into the man directory of an R package. I strongly recommend to not write these files by hand but instead use in-line documentation called roxygen and help functions from the devtools package to automatically create these help files. This will be covered in detail in another lecture note.

Two additional references for building R packages are:

  • R Packages by Hadley Wickham (currently out of date with respect to the addition of the usethis package)
  • Writing R Extensions the official guide from CRAN on how to create an R package

Minimal package skeleton

The bare minimum for an R package is a directory with the following:

  • a DESCRIPTION text file, which gives information about the package
  • a NAMESPACE text file, which tells which functions are imported or exported by the package (best written automatically, not manually)
  • a directory called R with the R scripts inside

We can get started by using the create_package() function from the usethis package. Suppose we have an R script with the following function:

add <- function(x,y,negative=FALSE) {
  z <- x + y
  if (negative) {
    z <- -1 * z
  }
  z
}

We will assume that we are running the following code on the class virtual machine, which uses R version 4.1.2. Let’s say we will create an R package called foo. We can start with the following:

library(usethis)
create_package("foo", roxygen=FALSE)

This will make a directory called foo in you current working directory, and will then will open a new Rstudio instance where the working directory to foo. For now, we set Roxygen equal to FALSE since we have not covered Roxygen yet. This option allows for automatic exporting of functions to the namespace when FALSE. If set to TRUE (default), the function assumes you will be using Roxygen to document functions and will export functions in accordance to your Roxygen documentation, leaving the NAMESPACE file blank upon creation.

In your current working directory, we can print the contents of the foo directory

list.files("foo")

You should see the following:

[1] "DESCRIPTION" "foo.Rproj"   "NAMESPACE"   "R"  

It will start with a DESCRIPTION file that looks like:

Package: foo
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R: 
    person("First", "Last", , "first.last@example.com", role = c("aut", "cre"),
           comment = c(ORCID = "YOUR-ORCID-ID"))
Description: What the package does (one paragraph).
License: `use_mit_license()`, `use_gpl3_license()` or friends to pick a
    license
Encoding: UTF-8

The DESCRIPTION file has a number of fields that we need to populate. We may start editing this file:

Package: foo
Title: Functions for doing great things
Version: 0.0.1
Authors@R: person("Jane","Doe",email="jdoe@uni.edu",role=c("aut","cre"))
Description: Contains amazing functions for doing amazing things.
License: GPL-2
Encoding: UTF-8

There are a number of licenses that are possible and these are important to read about and consider before releasing code. Here is a list of licenses that are in use for R:

https://www.r-project.org/Licenses/

R itself is licensed under GPL-2 | GPL-3.

If we simply move the R script above into the R directory, we are done with the bare minimum for our first R package, because the NAMESPACE file already says to export all functions that do not begin with a period:

exportPattern("^[^\\.]")

A complication is if we want to use functions from another package in our package. The way to do this is to import or depend on other packages. The easiest way to handle importing functions from another package is through the roxygen documentation format, so we will hold off on discussing this in the next lecture note. We have some description of the different kinds of package dependencies at the bottom of this lecture note.

We can build a shareable package “tarball” using the build function:

library(devtools)
build("foo")

Alternatively, this can be done on the command line, from one directory above the package directory, with R CMD build foo.

The build command prints out the following message and returns the location of the package tarball.

√  checking for file 'C:\statcomp_src\foo/DESCRIPTION' (2.3s)
-  preparing 'foo':
√  checking DESCRIPTION meta-information
-  checking for LF line-endings in source and make files and shell scripts
-  checking for empty or unneeded directories
-  building 'foo_0.0.1.tar.gz'
   
[1] "C:/statcomp_src/foo_0.0.1.tar.gz"

Now if we want to share our function add with a collaborator, we can send them the file foo_0.0.1.tar.gz.

Version numbers

The most important thing about version numbers is that they are free. You should “bump” the version number any time you make a change that you will push out the world (e.g. anytime you push your changes to GitHub or other repository). Even if you are working entirely by yourself, it is useful to be able to pinpoint differences in code output based on the version string (which will appear if you append the session information to the output of every script or Rmarkdown file you run).

There are three parts to most R package version numbers:

x.y.z

Roughly:

  • The “x” part is for major releases
  • The “y” part is for minor releases
  • The “z” part is for any change to the package

In Bioconductor, there is some additional information and restrictions on these numbers. “y” is even for release versions and odd for development versions. They have additional information on the system for Bioconductor package version numbers.

The first time a package is started, you may choose 0.0.1 and then add 1 to the “z” digit every time you make a change to the package code. It would be typical to use 1.0.0 when you first “release” the package, which you would do yourself if you are planning to host on GitHub or on CRAN. If you still consider the package a working “beta”, then you might only use 0.1.0, until you feel it is ready for 1.0.0. For Bioconductor packages, you would submit 0.99.z to the Bioconductor repository, and the Bioconductor machines would increment to 1.0.0 on your behalf for the first release. There is some additional information about version numbers and releasing a package from Hadley Wickham.

In R, package versions are a special class and you can compare package versions like so:

packageVersion("stats")
## [1] '4.1.2'
packageVersion("stats") >= "3.0.0"
## [1] TRUE

If you want a R script to produce an error if the package is out of data you can use:

stopifnot(packageVersion("stats") >= "3.0.0")

For an R package, if you need a specific version of another package, you should include a string in parentheses after its name in the Depends, Imports or Suggests field, for example:

Imports: foo (>= 1.2.0)

We will discuss these fields a bit more in the documentation notes.

Loading and sharing packages

By far the easiest way to load the package into R while you are developing it is to use the load_all function from the devtools package. This does not require the creation of the package tarball with the .tar.gz ending, but simply mimics an installation of the package. From within the package you can simply call load_all(), or you can also specify the path to the package:

load_all("/path/to/foo")

You also bypass having to call library(foo) as the package will already be loaded.

Another way to load the package is the standard install.packages function but specifying that the package tarball is a local file, not a name of a package in a remote repository. This then requires an explicit library(foo) call afterward.

install.packages("foo_0.0.1.tar.gz", repos=NULL)

You may need to restart your R session, or try this is another R session if you already have loaded the foo package in certain versions of R. Another easy way to share your package is to put all of the files into a GitHub repository. Then others can install the package on their machines simply with install_github("username/foo") using the devtools package. Again, this requires a library(foo) call afterward to load the package.

We still haven’t written any documentation, so when we try to load help we are told it is missing:

library(foo)
help(topic="add", package="foo")

Either through using load_all or installing and then loading with library, we can now use our function:

add(3,4)
## [1] 7
add(3,4,negative=TRUE)
## [1] -7

And for reproducibility sake, we can ask about the package version:

packageVersion("foo")
## [1] '0.0.1'

And we can include the following at the end of scripts to include all package versions (and other important details):

library(devtools)
## Loading required package: usethis
session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       macOS Big Sur 10.16
##  system   x86_64, darwin17.0
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       America/New_York
##  date     2023-01-12
##  pandoc   2.14.0.3 @ /Applications/RStudio.app/Contents/MacOS/pandoc/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  bslib         0.3.1   2021-10-06 [1] CRAN (R 4.1.0)
##  cachem        1.0.6   2021-08-19 [1] CRAN (R 4.1.0)
##  callr         3.7.0   2021-04-20 [1] CRAN (R 4.1.0)
##  cli           3.1.0   2021-10-27 [1] CRAN (R 4.1.0)
##  crayon        1.4.2   2021-10-29 [1] CRAN (R 4.1.0)
##  desc          1.4.0   2021-09-28 [1] CRAN (R 4.1.0)
##  devtools    * 2.4.3   2021-11-30 [1] CRAN (R 4.1.0)
##  digest        0.6.29  2021-12-01 [1] CRAN (R 4.1.0)
##  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 4.1.0)
##  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.0)
##  fs            1.5.2   2021-12-08 [1] CRAN (R 4.1.0)
##  glue          1.5.1   2021-11-30 [1] CRAN (R 4.1.0)
##  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.0)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.1.0)
##  jsonlite      1.7.2   2020-12-09 [1] CRAN (R 4.1.0)
##  knitr         1.36    2021-09-29 [1] CRAN (R 4.1.0)
##  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.0)
##  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.1.0)
##  memoise       2.0.1   2021-11-26 [1] CRAN (R 4.1.0)
##  pkgbuild      1.3.1   2021-12-20 [1] CRAN (R 4.1.0)
##  pkgload       1.2.4   2021-11-30 [1] CRAN (R 4.1.0)
##  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.1.0)
##  processx      3.5.2   2021-04-30 [1] CRAN (R 4.1.0)
##  ps            1.6.0   2021-02-28 [1] CRAN (R 4.1.0)
##  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
##  R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.0)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.0)
##  rlang         1.0.5   2022-08-31 [1] CRAN (R 4.1.2)
##  rmarkdown     2.11    2021-09-14 [1] CRAN (R 4.1.0)
##  rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.1.0)
##  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.0)
##  sass          0.4.1   2022-03-23 [1] CRAN (R 4.1.2)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.0)
##  stringi       1.7.6   2021-11-29 [1] CRAN (R 4.1.0)
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.1.0)
##  testthat      3.1.1   2021-12-03 [1] CRAN (R 4.1.0)
##  usethis     * 2.1.5   2021-12-09 [1] CRAN (R 4.1.0)
##  withr         2.4.3   2021-11-30 [1] CRAN (R 4.1.0)
##  xfun          0.28    2021-11-04 [1] CRAN (R 4.1.0)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.1.0)
## 
##  [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Types of package dependencies

There are three main types of formal package dependencies in R:

  • Depends
  • Imports
  • Suggests

Each of these can be listed in the DESCRIPTION file. We will show in the documentation lecture note how to import specific functions from other packages, and how to add these packages to the Imports: line in the DESCRIPTION file. I recommend to import specific functions from other packages, rather than the entire package, but it is also possible to import the entire package. Note that the packages that you import functions from are not attached to the search path when you load your package with library(foo). This means that, even if you import one or more functions from another package within your package, those functions are not available to the user unless they also load the package with another call to library.

The Suggests: field of the DESCRIPTION file is for packages which are not absolutely required for your package to work, but which your package can make use of, or which are used in examples or the vignette of your package. I recommend to list a package under Suggests also in the case that you have a function which can make use of another package, but has a fallback implementation which does not require the other package. Or if there is no fallback, the function should explain that it cannot be run because the other package is not available. This testing to see if the other package is available can be performed using requireNamespace("foo", quiet=TRUE), which will return TRUE if the package is installed and FALSE if not. You can then use the function from the other package with :: within your package, for example: cool::fancy can be used to call the fancy function from the cool package.

This leaves Depends to be explained. If you read Hadley Wickham’s R Package reference and particular the section on Namespace, you’ll see that he does not recommend the use of Depends at all. I agree with his arguments laid out there and simply quote the key sentences here:

The main difference is that where Imports just loads the package, Depends attaches it. There are no other differences. … Unless there is a good reason otherwise, you should always list packages in Imports not Depends. That’s because a good package is self-contained, and minimises changes to the global environment (including the search path).

Package checking

R has a very useful, but also very picky checking software associated with it. It can be run with the check function from the devtools package, or by running R CMD check foo_0.0.1.tar.gz from the command line. Passing all of the R package checks is a good idea if you plan to share your code with others. In addition, there may be other checks that you will need to pass if you want to submit to Bioconductor. We won’t cover R package checking, as we only have limited time, but the reference on R Packages from Hadley Wickham discusses this.