Simple EDA using dplyr and ggplot2

Author

In this document, we will explore a small table of data about the genomes of various organisms. I also include a file, scrape_ensembl.R which shows how I scraped this data from the Ensembl website.

We start by reading in the TSV (tab-separated value) data using the readr package:

library(readr)
library(here)
orgs <- read_tsv(here("eda","organisms_genes.tsv"))
head(orgs)

# A tibble: 6 × 8
  type   name                species                        bp coding noncoding pseudo    tx
  <chr>  <chr>               <chr>                       <dbl>  <dbl>     <dbl>  <dbl> <dbl>
1 mammal Alpaca              Vicugna_pacos          2967746133  11765      2532    898 15236
2 fish   Amazon molly        Poecilia_formosa        748923461  23615       679     60 31637
3 invert Roundword           Caenorhabditis_elegans  100286401  20362     24719   1658 58941
4 invert Sea squirt          Ciona_intestinalis      115227500  16671       455     27 17784
5 invert Solitary sea squirt Ciona_savignyi          177003750  11616       340    216 20711
6 mammal Cat                 Felis_catus            2455541136  19493      1855    542 22656

table(orgs$type)


  bird   fish invert mammal  plant single 
     4      8      7     14      4      1

read_tsv returns a tibble object, which is similar to a data.frame, being tabular data with different types of variables as columns (character, double, integer, etc.). One difference, aside from the presentation when you show the object, is that the characters are not automatically converted to factor, as they are with read.delim or the like.

Let’s convert the first column to a factor. To do so, I show a special function from the magrittr package called a compound assignment pipe. This looks like %<>%. This take the left-hand side and performs the function on the right-hand side, finally re-assigning the end result to the left-hand side.

We will see other “pipes” from magrittr later on.

library(magrittr)
orgs$type %<>% factor

Let’s make some plots of the number of basepairs in the genomes of these organisms, split by the type of organism. There’s one plant with a really large genome. We could use the stripplot function, which is in the lattice package: this draws points for each level of a factor, similar to a boxplot, but is preferred when there are not many observations (e.g. < 20).

library(lattice)
with(orgs, stripplot(bp ~ type))

with(orgs, stripplot(log10(bp) ~ type))

Compare to boxplot. Again, given the small number of samples, stripplot is preferred.

with(orgs, boxplot(bp ~ type))

If you haven’t seen with(...) before, it’s a way to temporarily use the variables (columns) of a data.frame as if they were variables in the environment, that is, without using the $. As we will see below, ggplot2 also offers such a convenient syntax for working with column names.

Most of the organisms have around 1 billion basepairs in their genome.

That large genome is wheat.

orgs[which.max(orgs$bp),]

# A tibble: 1 × 8
  type  name  species                    bp coding noncoding pseudo     tx
  <fct> <chr> <chr>                   <dbl>  <dbl>     <dbl>  <dbl>  <dbl>
1 plant Wheat Triticum_aestivum 13427354022 104272     10156     NA 166589

Let’s remove it for now to make plots where we can see the range of the other organisms.

orgs <- orgs[orgs$name != "Wheat",]

I plot the number of genes that code for proteins over the total number of basepairs:

with(orgs, plot(bp, coding, col=type, pch=20, cex=2))
legend("topright", levels(orgs$type), col=1:nlevels(orgs$type), pch=20)

# right click to Esc from identify()
# with(orgs, identify(bp, coding, labels=orgs$name))

Another way to make a plot with labels is using ggplot2:

library(ggplot2)
ggplot(orgs, aes(bp, coding, color=type)) +
  geom_point(size=3)

Which organisms have the largest genome (after wheat)?

orgs[order(orgs$bp, decreasing=TRUE)[1:5],c("name","bp")]

# A tibble: 5 × 2
  name               bp
  <chr>           <dbl>
1 Orangutan  3446771396
2 Chimpanzee 3309577922
3 Elephant   3196760833
4 Human      3096649726
5 Gorilla    3040677044

Another way to get quick summaries of a dataset is skim from the skimr package:

library(skimr)


Attaching package: 'skimr'

The following object is masked from 'package:testthat':

    matches

skim(orgs)

Data summary
Name	orgs
Number of rows	37
Number of columns	8
_______________________
Column type frequency:
character	2
factor	1
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
name	0	1	3	19	0	37	0
species	0	1	8	24	0	37	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
type	0	1	FALSE	6	mam: 14, fis: 8, inv: 7, bir: 4

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bp	0	1.00	1.574812e+09	1.111087e+09	12157105	729664433.00	1233186341	2.521924e+09	3446771396	▆▇▁▆▅
coding	0	1.00	1.981016e+04	5.999450e+03	6692	17488.00	20033	2.143700e+04	39498	▂▅▇▁▁
noncoding	0	1.00	5.999380e+03	1.008250e+04	340	1042.00	2644	6.492000e+03	55401	▇▁▁▁▁
pseudo	3	0.92	1.276790e+03	3.056010e+03	21	69.25	382	8.727500e+02	14589	▇▁▁▁▁
tx	0	1.00	4.099754e+04	3.876149e+04	7126	21326.00	29196	3.811800e+04	199234	▇▁▁▁▁

For many operations, it’s useful to use the dplyr package to manipulate data.frames or tibbles.

Before, I introduce two new functions. First, the forward pipe %>%, which sends the left-hand side to the first argument of the function on the right-hand side. Seconds, the right assignment -> is just like the more familiar left assignment, but points in the same direction of the pipes, so I find useful here.

A simple example of the pipe is:

x <- c(1:9,1000)
x %>% mean(trim=.1)

[1] 5.5

# equivalent to
mean(x, trim=.1)

[1] 5.5

Using dplyr, let’s show, adding a new column which tallies up the number of coding and non-coding genes:

library(dplyr)
orgs %<>% mutate(genes = coding + noncoding)
orgs$genes

 [1] 14297 24294 45081 17126 11956 21348 25250 24838 27440 21636 23819 31754 20340 16201 22677 17302
[17] 19226 27663 22591 42839 13067 36914 27420 24754 20751 29229  7105 22404 27063 14878 19231 19748
[33] 31892 18212 29053 44474 91080

dplyr is really useful for grouping and summarization tasks. The following line groups the organisms by type and summarizes the number for each type, the mean ratio of genes which are coding and the standard deviation of the ratio. One type has an SD of NA, why?

orgs %>%
  group_by(type) %>%
  summarize(n=n(),
            cr_avg=mean(coding/genes),
            cr_sd=sd(coding/genes))

# A tibble: 6 × 4
  type       n cr_avg   cr_sd
  <fct>  <int>  <dbl>   <dbl>
1 bird       4  0.903  0.110 
2 fish       8  0.911  0.0722
3 invert     7  0.840  0.187 
4 mammal    14  0.764  0.128 
5 plant      3  0.744  0.307 
6 single     1  0.942 NA

Let’s save this as a new table, tab:

tab <- orgs %>%
  group_by(type) %>%
  summarize(n=n(),
            cr_avg=mean(coding/genes),
            cr_sd=sd(coding/genes))

Two packages with functions for printing tables (as Latex or markdown):

library(xtable)
xtable(tab)

% latex table generated in R 4.4.1 by xtable 1.8-4 package
% Wed Aug 21 10:56:50 2024
\begin{table}[ht]
\centering
\begin{tabular}{rlrrr}
  \hline
 & type & n & cr\_avg & cr\_sd \\ 
  \hline
1 & bird &   4 & 0.90 & 0.11 \\ 
  2 & fish &   8 & 0.91 & 0.07 \\ 
  3 & invert &   7 & 0.84 & 0.19 \\ 
  4 & mammal &  14 & 0.76 & 0.13 \\ 
  5 & plant &   3 & 0.74 & 0.31 \\ 
  6 & single &   1 & 0.94 &  \\ 
   \hline
\end{tabular}
\end{table}

library(knitr)
kable(tab)

type	n	cr_avg	cr_sd
bird	4	0.9032821	0.1099685
fish	8	0.9111016	0.0721778
invert	7	0.8396879	0.1866490
mammal	14	0.7637233	0.1281888
plant	3	0.7439093	0.3066560
single	1	0.9418719	NA

We can use summarization as a mid-point and then send those summaries to a plotting function. For example, points and line ranges:

# here we have to name the median something other than `bp`
orgs %>%
  group_by(type) %>%
  summarize(basepairs=median(bp),
            min=min(bp),
            max=max(bp)) %>%
  ggplot(aes(type, basepairs, ymin=min, ymax=max)) +
  geom_pointrange()

ggplot2 is great for grouping points together with summarizing graphical elements, such as lines through sets of points:

ggplot(orgs, aes(bp, coding, col=type)) +
  geom_point() + 
  geom_smooth(se=FALSE, method="lm", show.legend=FALSE)

`geom_smooth()` using formula = 'y ~ x'

Also for “faceting” plots, breaking into multiple scatter plots by breaking along a variable:

ggplot(orgs, aes(bp, coding)) +
  geom_point() + 
  facet_wrap(~ type)

Finally, I like to show a useful plot constructed using ggplot2, just with simulated data. It is common that wants to show overlapping densities or histograms, which is not a simple task with base R graphics. It’s fairly simple with ggplot2.

We create some simulated data drawing three sets of samples from three distributions.

ns <- c(1000,500,200)
dat <- data.frame(type=rep(letters[1:3], ns),
                  z=rnorm(1700,
                          mean=rep(c(0,2,4),ns),
                          sd=rep(c(1,1,1),ns)))
head(dat)

  type          z
1    a  0.5076440
2    a  0.3512353
3    a -1.2391776
4    a -0.7065017
5    a  1.2554839
6    a  0.2416517

table(dat$type)


   a    b    c 
1000  500  200

It’s straightforward to draw these as overlapping densities:

ggplot(dat, aes(z, col=type, fill=type)) + geom_density(alpha=0.1)

Or as histograms:

ggplot(dat, aes(z, col=type, fill=type)) +
  geom_histogram(alpha=0.1, position="identity")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Another option, putting the bars beside each other:

ggplot(dat, aes(z, col=type, fill=type)) +
  geom_histogram(position="dodge")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.