In this document, we will explore a small table of data about the
genomes of various organisms. I also include a file,
scrape_ensembl.R
which shows how I scraped this data from
the Ensembl website.
We start by reading in the TSV (tab-separated value) data using the readr package:
library(readr)
library(here)
orgs <- read_tsv(here("eda","organisms_genes.tsv"))
head(orgs)
## # A tibble: 6 × 8
## type name species bp coding noncoding pseudo tx
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 mammal Alpaca Vicugna_pacos 2.97e9 11765 2532 898 15236
## 2 fish Amazon molly Poecilia_formosa 7.49e8 23615 679 60 31637
## 3 invert Roundword Caenorhabditis_elegans 1.00e8 20362 24719 1658 58941
## 4 invert Sea squirt Ciona_intestinalis 1.15e8 16671 455 27 17784
## 5 invert Solitary sea squirt Ciona_savignyi 1.77e8 11616 340 216 20711
## 6 mammal Cat Felis_catus 2.46e9 19493 1855 542 22656
table(orgs$type)
##
## bird fish invert mammal plant single
## 4 8 7 14 4 1
read_tsv
returns a tibble object, which is
similar to a data.frame, being tabular data with different
types of variables as columns (character, double, integer, etc.). One
difference, aside from the presentation when you show
the
object, is that the characters are not automatically converted to
factor, as they are with read.delim
or the
like.
Let’s convert the first column to a factor. To do so, I show
a special function from the magrittr package called a compound
assignment pipe. This looks like %<>%
. This take the
left-hand side and performs the function on the right-hand side, finally
re-assigning the end result to the left-hand side.
We will see other “pipes” from magrittr later on.
library(magrittr)
orgs$type %<>% factor
Let’s make some plots of the number of basepairs in the genomes of
these organisms, split by the type of organism. There’s one plant with a
really large genome. We could use the stripplot
function,
which is in the lattice
package: this draws points for each
level of a factor, similar to a boxplot, but is preferred when there are
not many observations (e.g. < 20).
library(lattice)
with(orgs, stripplot(bp ~ type))
with(orgs, stripplot(log10(bp) ~ type))
Compare to boxplot. Again, given the small number of samples,
stripplot
is preferred.
with(orgs, boxplot(bp ~ type))
If you haven’t seen with(...)
before, it’s a way to
temporarily use the variables (columns) of a data.frame as if
they were variables in the environment, that is, without using the
$
. As we will see below, ggplot2
also offers
such a convenient syntax for working with column names.
Most of the organisms have around 1 billion basepairs in their genome.
That large genome is wheat.
orgs[which.max(orgs$bp),]
## # A tibble: 1 × 8
## type name species bp coding noncoding pseudo tx
## <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 plant Wheat Triticum_aestivum 13427354022 104272 10156 NA 166589
Let’s remove it for now to make plots where we can see the range of the other organisms.
orgs <- orgs[orgs$name != "Wheat",]
I plot the number of genes that code for proteins over the total number of basepairs:
with(orgs, plot(bp, coding, col=type, pch=20, cex=2))
legend("topright", levels(orgs$type), col=1:nlevels(orgs$type), pch=20)
# right click to Esc from identify()
# with(orgs, identify(bp, coding, labels=orgs$name))
Another way to make a plot with labels is using ggplot2:
library(ggplot2)
ggplot(orgs, aes(bp, coding, color=type)) +
geom_point(size=3)
Which organisms have the largest genome (after wheat)?
orgs[order(orgs$bp, decreasing=TRUE)[1:5],c("name","bp")]
## # A tibble: 5 × 2
## name bp
## <chr> <dbl>
## 1 Orangutan 3446771396
## 2 Chimpanzee 3309577922
## 3 Elephant 3196760833
## 4 Human 3096649726
## 5 Gorilla 3040677044
Another way to get quick summaries of a dataset is skim
from the skimr package:
library(skimr)
skim(orgs)
Name | orgs |
Number of rows | 37 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 2 |
factor | 1 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1 | 3 | 19 | 0 | 37 | 0 |
species | 0 | 1 | 8 | 24 | 0 | 37 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
type | 0 | 1 | FALSE | 6 | mam: 14, fis: 8, inv: 7, bir: 4 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bp | 0 | 1.00 | 1.574812e+09 | 1.111087e+09 | 12157105 | 729664433.00 | 1233186341 | 2.521924e+09 | 3446771396 | ▆▇▁▆▅ |
coding | 0 | 1.00 | 1.981016e+04 | 5.999450e+03 | 6692 | 17488.00 | 20033 | 2.143700e+04 | 39498 | ▂▅▇▁▁ |
noncoding | 0 | 1.00 | 5.999380e+03 | 1.008250e+04 | 340 | 1042.00 | 2644 | 6.492000e+03 | 55401 | ▇▁▁▁▁ |
pseudo | 3 | 0.92 | 1.276790e+03 | 3.056010e+03 | 21 | 69.25 | 382 | 8.727500e+02 | 14589 | ▇▁▁▁▁ |
tx | 0 | 1.00 | 4.099754e+04 | 3.876149e+04 | 7126 | 21326.00 | 29196 | 3.811800e+04 | 199234 | ▇▁▁▁▁ |
For many operations, it’s useful to use the dplyr package to manipulate data.frames or tibbles.
Before, I introduce two new functions. First, the forward pipe
%>%
, which sends the left-hand side to the first
argument of the function on the right-hand side. Seconds, the right
assignment ->
is just like the more familiar left
assignment, but points in the same direction of the pipes, so I find
useful here.
A simple example of the pipe is:
x <- c(1:9,1000)
x %>% mean(trim=.1)
## [1] 5.5
# equivalent to
mean(x, trim=.1)
## [1] 5.5
Using dplyr, let’s show, adding a new column which tallies up the number of coding and non-coding genes:
library(dplyr)
orgs %<>% mutate(genes = coding + noncoding)
orgs$genes
## [1] 14297 24294 45081 17126 11956 21348 25250 24838 27440 21636 23819 31754 20340 16201
## [15] 22677 17302 19226 27663 22591 42839 13067 36914 27420 24754 20751 29229 7105 22404
## [29] 27063 14878 19231 19748 31892 18212 29053 44474 91080
dplyr is really useful for grouping and summarization tasks. The
following line groups the organisms by type and summarizes the number
for each type, the mean ratio of genes which are coding and the standard
deviation of the ratio. One type
has an SD of NA, why?
orgs %>%
group_by(type) %>%
summarize(n=n(),
cr_avg=mean(coding/genes),
cr_sd=sd(coding/genes))
## # A tibble: 6 × 4
## type n cr_avg cr_sd
## <fct> <int> <dbl> <dbl>
## 1 bird 4 0.903 0.110
## 2 fish 8 0.911 0.0722
## 3 invert 7 0.840 0.187
## 4 mammal 14 0.764 0.128
## 5 plant 3 0.744 0.307
## 6 single 1 0.942 NA
Let’s save this as a new table, tab
:
tab <- orgs %>%
group_by(type) %>%
summarize(n=n(),
cr_avg=mean(coding/genes),
cr_sd=sd(coding/genes))
Two packages with functions for printing tables (as Latex or markdown):
library(xtable)
xtable(tab)
## % latex table generated in R 4.2.1 by xtable 1.8-4 package
## % Wed Aug 17 21:20:25 2022
## \begin{table}[ht]
## \centering
## \begin{tabular}{rlrrr}
## \hline
## & type & n & cr\_avg & cr\_sd \\
## \hline
## 1 & bird & 4 & 0.90 & 0.11 \\
## 2 & fish & 8 & 0.91 & 0.07 \\
## 3 & invert & 7 & 0.84 & 0.19 \\
## 4 & mammal & 14 & 0.76 & 0.13 \\
## 5 & plant & 3 & 0.74 & 0.31 \\
## 6 & single & 1 & 0.94 & \\
## \hline
## \end{tabular}
## \end{table}
library(knitr)
kable(tab)
type | n | cr_avg | cr_sd |
---|---|---|---|
bird | 4 | 0.9032821 | 0.1099685 |
fish | 8 | 0.9111016 | 0.0721778 |
invert | 7 | 0.8396879 | 0.1866490 |
mammal | 14 | 0.7637233 | 0.1281888 |
plant | 3 | 0.7439093 | 0.3066560 |
single | 1 | 0.9418719 | NA |
We can use summarization as a mid-point and then send those summaries to a plotting function. For example, points and line ranges:
# here we have to name the median something other than `bp`
orgs %>%
group_by(type) %>%
summarize(basepairs=median(bp),
min=min(bp),
max=max(bp)) %>%
ggplot(aes(type, basepairs, ymin=min, ymax=max)) +
geom_pointrange()
ggplot2 is great for grouping points together with summarizing graphical elements, such as lines through sets of points:
ggplot(orgs, aes(bp, coding, col=type)) +
geom_point() +
geom_smooth(se=FALSE, method="lm", show.legend=FALSE)
## `geom_smooth()` using formula 'y ~ x'
Also for “faceting” plots, breaking into multiple scatter plots by breaking along a variable:
ggplot(orgs, aes(bp, coding)) +
geom_point() +
facet_wrap(~ type)
Finally, I like to show a useful plot constructed using ggplot2, just with simulated data. It is common that wants to show overlapping densities or histograms, which is not a simple task with base R graphics. It’s fairly simple with ggplot2.
We create some simulated data drawing three sets of samples from three distributions.
ns <- c(1000,500,200)
dat <- data.frame(type=rep(letters[1:3], ns),
z=rnorm(1700,
mean=rep(c(0,2,4),ns),
sd=rep(c(1,1,1),ns)))
head(dat)
## type z
## 1 a 1.1498130
## 2 a -0.8524997
## 3 a 0.5411426
## 4 a 1.0387436
## 5 a 1.2145224
## 6 a 0.4237036
table(dat$type)
##
## a b c
## 1000 500 200
It’s straightforward to draw these as overlapping densities:
ggplot(dat, aes(z, col=type, fill=type)) + geom_density(alpha=0.1)
Or as histograms:
ggplot(dat, aes(z, col=type, fill=type)) +
geom_histogram(alpha=0.1, position="identity")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Another option, putting the bars beside each other:
ggplot(dat, aes(z, col=type, fill=type)) +
geom_histogram(position="dodge")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.