In this document, we will explore a small table of data about the genomes of various organisms. I also include a file, scrape_ensembl.R which shows how I scraped this data from the Ensembl website.
We start by reading in the TSV (tab-separated value) data using the readr package:
read_tsv returns a tibble object, which is similar to a data.frame, being tabular data with different types of variables as columns (character, double, integer, etc.). One difference, aside from the presentation when you show the object, is that the characters are not automatically converted to factor, as they are with read.delim or the like.
Let’s convert the first column to a factor. To do so, I show a special function from the magrittr package called a compound assignment pipe. This looks like %<>%. This take the left-hand side and performs the function on the right-hand side, finally re-assigning the end result to the left-hand side.
We will see other “pipes” from magrittr later on.
library(magrittr)orgs$type %<>% factor
Let’s make some plots of the number of basepairs in the genomes of these organisms, split by the type of organism. There’s one plant with a really large genome. We could use the stripplot function, which is in the lattice package: this draws points for each level of a factor, similar to a boxplot, but is preferred when there are not many observations (e.g. < 20).
library(lattice)with(orgs, stripplot(bp ~ type))
with(orgs, stripplot(log10(bp) ~ type))
Compare to boxplot. Again, given the small number of samples, stripplot is preferred.
with(orgs, boxplot(bp ~ type))
If you haven’t seen with(...) before, it’s a way to temporarily use the variables (columns) of a data.frame as if they were variables in the environment, that is, without using the $. As we will see below, ggplot2 also offers such a convenient syntax for working with column names.
Most of the organisms have around 1 billion basepairs in their genome.
That large genome is wheat.
orgs[which.max(orgs$bp),]
# A tibble: 1 × 8
type name species bp coding noncoding pseudo tx
<fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 plant Wheat Triticum_aestivum 13427354022 104272 10156 NA 166589
Let’s remove it for now to make plots where we can see the range of the other organisms.
orgs <- orgs[orgs$name !="Wheat",]
I plot the number of genes that code for proteins over the total number of basepairs:
# A tibble: 5 × 2
name bp
<chr> <dbl>
1 Orangutan 3446771396
2 Chimpanzee 3309577922
3 Elephant 3196760833
4 Human 3096649726
5 Gorilla 3040677044
Another way to get quick summaries of a dataset is skim from the skimr package:
library(skimr)
Attaching package: 'skimr'
The following object is masked from 'package:testthat':
matches
skim(orgs)
Data summary
Name
orgs
Number of rows
37
Number of columns
8
_______________________
Column type frequency:
character
2
factor
1
numeric
5
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
name
0
1
3
19
0
37
0
species
0
1
8
24
0
37
0
Variable type: factor
skim_variable
n_missing
complete_rate
ordered
n_unique
top_counts
type
0
1
FALSE
6
mam: 14, fis: 8, inv: 7, bir: 4
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
bp
0
1.00
1.574812e+09
1.111087e+09
12157105
729664433.00
1233186341
2.521924e+09
3446771396
▆▇▁▆▅
coding
0
1.00
1.981016e+04
5.999450e+03
6692
17488.00
20033
2.143700e+04
39498
▂▅▇▁▁
noncoding
0
1.00
5.999380e+03
1.008250e+04
340
1042.00
2644
6.492000e+03
55401
▇▁▁▁▁
pseudo
3
0.92
1.276790e+03
3.056010e+03
21
69.25
382
8.727500e+02
14589
▇▁▁▁▁
tx
0
1.00
4.099754e+04
3.876149e+04
7126
21326.00
29196
3.811800e+04
199234
▇▁▁▁▁
For many operations, it’s useful to use the dplyr package to manipulate data.frames or tibbles.
Before, I introduce two new functions. First, the forward pipe %>%, which sends the left-hand side to the first argument of the function on the right-hand side. Seconds, the right assignment -> is just like the more familiar left assignment, but points in the same direction of the pipes, so I find useful here.
A simple example of the pipe is:
x <-c(1:9,1000)x %>%mean(trim=.1)
[1] 5.5
# equivalent tomean(x, trim=.1)
[1] 5.5
Using dplyr, let’s show, adding a new column which tallies up the number of coding and non-coding genes:
dplyr is really useful for grouping and summarization tasks. The following line groups the organisms by type and summarizes the number for each type, the mean ratio of genes which are coding and the standard deviation of the ratio. One type has an SD of NA, why?
We can use summarization as a mid-point and then send those summaries to a plotting function. For example, points and line ranges:
# here we have to name the median something other than `bp`orgs %>%group_by(type) %>%summarize(basepairs=median(bp),min=min(bp),max=max(bp)) %>%ggplot(aes(type, basepairs, ymin=min, ymax=max)) +geom_pointrange()
ggplot2 is great for grouping points together with summarizing graphical elements, such as lines through sets of points:
Finally, I like to show a useful plot constructed using ggplot2, just with simulated data. It is common that wants to show overlapping densities or histograms, which is not a simple task with base R graphics. It’s fairly simple with ggplot2.
We create some simulated data drawing three sets of samples from three distributions.