Here we show off the way we specify genomic ranges in Bioconductor, using the GenomicRanges package and the GRanges class.

A GRanges object specifies both the basepairs involved in the range (these are integer ranges), the strand (+ or -), and the chromosome. Chromosomes in Bioconductor are called “sequences” (because sometimes we may want to specify ranges of non-chromosomes, for example viral DNA, un-incorporated contigs, etc.).

Interval ranges

Let’s start with a set of interval ranges, an IRanges object.

library(GenomicRanges)
ir <- IRanges(start=100 + 1:5, width=10)
ir
## IRanges object with 5 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       101       110        10
##   [2]       102       111        10
##   [3]       103       112        10
##   [4]       104       113        10
##   [5]       105       114        10

We could also specify the same ranges by setting the end:

ir <- IRanges(start=100 + 1:5, end=100 + 10:14)
ir
## IRanges object with 5 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       101       110        10
##   [2]       102       111        10
##   [3]       103       112        10
##   [4]       104       113        10
##   [5]       105       114        10

We can make a plot of these ranges. It’s a bit different than a typical plot, because I want to point out that we have discrete locations in the genome (so the position isn’t defined over \(\mathbb{R}\) but over \(\mathbb{N}\)). So I’ve shifted the positions of the labels by 0.5 (see the axis command). Another detail to notice is that the ranges are inclusive. So the first range, from 101 to 110 includes the 110 basepair. So I’ve added +1 when I draw the arrows to show it includes the last basepair.

(Note that the convention used in Bioconductor to define interval ranges is not universal. The BED convention is that the range does not include the start position of the range. So 101-110 in BED format would not include the 101th basepair.)

library(rafalib)
nullplot(start(range(ir)), end(range(ir))+1, 0, 6, xaxt="n")
abline(v=100:115, col="grey")
axis(1, at=100:115 + .5, labels=100:115) 
arrows(start(ir),seq_along(ir),end(ir)+1,seq_along(ir),
       lwd=3,code=3,angle=90,length=.05)

Genomic ranges

So far we’ve just shown interval ranges. To specify a genomic range, we also will come up with the strand of the ranges (+ or -) and the chromosome.

st <- rep(c("+","-"),c(3,2))
st
## [1] "+" "+" "+" "-" "-"
gr <- GRanges(seqnames="chr1", ranges=ir, strand=st)
gr
## GRanges object with 5 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1   101-110      +
##   [2]     chr1   102-111      +
##   [3]     chr1   103-112      +
##   [4]     chr1   104-113      -
##   [5]     chr1   105-114      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Note that GRanges objects, when printed, also have a note about the genome. Here it says that we have an unspecified genome, and there is no information about the seqlengths, the lengths of the chromosomes. This information (basically in what universe do the ranges live) is important to track along with the ranges themselves, in case the user asks to, e.g. move all ranges 1 million basepairs to the right. This is not necessarily possible if we have a range that ends less than 1 million basepairs from the end of a chromosome. So the GRanges object really needs to also contain details about the genome itself.

Here we show some basic manipulations of GRanges:

gr + 10 # growing by 10 on either side
## GRanges object with 5 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1    91-120      +
##   [2]     chr1    92-121      +
##   [3]     chr1    93-122      +
##   [4]     chr1    94-123      -
##   [5]     chr1    95-124      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
shift(gr, 10) # shifting by 10
## GRanges object with 5 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1   111-120      +
##   [2]     chr1   112-121      +
##   [3]     chr1   113-122      +
##   [4]     chr1   114-123      -
##   [5]     chr1   115-124      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
resize(gr, width=1) # 1 bp from the start (+) or end (-)
## GRanges object with 5 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1       101      +
##   [2]     chr1       102      +
##   [3]     chr1       103      +
##   [4]     chr1       113      -
##   [5]     chr1       114      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
flank(gr, width=2, both=TRUE) # +/- 2 bp from " or "
## GRanges object with 5 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1    99-102      +
##   [2]     chr1   100-103      +
##   [3]     chr1   101-104      +
##   [4]     chr1   112-115      -
##   [5]     chr1   113-116      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Note that the last two commands, resize and flank, gave back ranges relative to the start of the ranges with positive strand, and to the end of the ranges with negative strand. This is particular useful for working with genes, in which we often care about the region surrounding the transcription start site or TSS of a gene. The TSS will be the left-most basepair of a gene on the positive strand, and the right-most basepair of a gene on the negative strand.

These functions would usually give us more information, but in this case the information was not specified:

seqnames(gr)
## factor-Rle of length 5 with 1 run
##   Lengths:    5
##   Values : chr1
## Levels(1): chr1
seqinfo(gr)
## Seqinfo object with 1 sequence from an unspecified genome; no seqlengths:
##   seqnames seqlengths isCircular genome
##   chr1             NA         NA   <NA>

Instead of looking at a toy GRanges, let’s download some actual genomic ranges, the set of human genes, as annotated by the Ensembl project, version 86.

This data is easily accessible in Bioconductor using the ensembldb package, and the specific Ensembl database package from Bioconductor.

This says there are about 60 thousand genes (according to Ensembl).

library(ensembldb)
library(EnsDb.Hsapiens.v86) # this pkg is about 75 Mb
edb <- EnsDb.Hsapiens.v86
g <- genes(edb)
length(g)
## [1] 63970

If you try to print g it will just show you the beginning and end:

g
## GRanges object with 63970 ranges and 6 metadata columns:
##                   seqnames            ranges strand |         gene_id   gene_name
##                      <Rle>         <IRanges>  <Rle> |     <character> <character>
##   ENSG00000223972        1       11869-14409      + | ENSG00000223972     DDX11L1
##   ENSG00000227232        1       14404-29570      - | ENSG00000227232      WASH7P
##   ENSG00000278267        1       17369-17436      - | ENSG00000278267   MIR6859-1
##   ENSG00000243485        1       29554-31109      + | ENSG00000243485   MIR1302-2
##   ENSG00000237613        1       34554-36081      - | ENSG00000237613     FAM138A
##               ...      ...               ...    ... .             ...         ...
##   ENSG00000224240        Y 26549425-26549743      + | ENSG00000224240     CYCSP49
##   ENSG00000227629        Y 26586642-26591601      - | ENSG00000227629  SLC25A15P1
##   ENSG00000237917        Y 26594851-26634652      - | ENSG00000237917     PARP4P1
##   ENSG00000231514        Y 26626520-26627159      - | ENSG00000231514     FAM58CP
##   ENSG00000235857        Y 56855244-56855488      + | ENSG00000235857     CTBP2P1
##                             gene_biotype seq_coord_system      symbol
##                              <character>      <character> <character>
##   ENSG00000223972 transcribed_unproces..       chromosome     DDX11L1
##   ENSG00000227232 unprocessed_pseudogene       chromosome      WASH7P
##   ENSG00000278267                  miRNA       chromosome   MIR6859-1
##   ENSG00000243485                lincRNA       chromosome   MIR1302-2
##   ENSG00000237613                lincRNA       chromosome     FAM138A
##               ...                    ...              ...         ...
##   ENSG00000224240   processed_pseudogene       chromosome     CYCSP49
##   ENSG00000227629 unprocessed_pseudogene       chromosome  SLC25A15P1
##   ENSG00000237917 unprocessed_pseudogene       chromosome     PARP4P1
##   ENSG00000231514   processed_pseudogene       chromosome     FAM58CP
##   ENSG00000235857   processed_pseudogene       chromosome     CTBP2P1
##                                         entrezid
##                                           <list>
##   ENSG00000223972 100287596,100287102,727856,...
##   ENSG00000227232                           <NA>
##   ENSG00000278267                      102466751
##   ENSG00000243485            105376912,100302278
##   ENSG00000237613           654835,645520,641702
##               ...                            ...
##   ENSG00000224240                           <NA>
##   ENSG00000227629                           <NA>
##   ENSG00000237917                           <NA>
##   ENSG00000231514                           <NA>
##   ENSG00000235857                           <NA>
##   -------
##   seqinfo: 357 sequences (1 circular) from GRCh38 genome

You can take a look at a single gene:

g[1]
## GRanges object with 1 range and 6 metadata columns:
##                   seqnames      ranges strand |         gene_id   gene_name           gene_biotype
##                      <Rle>   <IRanges>  <Rle> |     <character> <character>            <character>
##   ENSG00000223972        1 11869-14409      + | ENSG00000223972     DDX11L1 transcribed_unproces..
##                   seq_coord_system      symbol                       entrezid
##                        <character> <character>                         <list>
##   ENSG00000223972       chromosome     DDX11L1 100287596,100287102,727856,...
##   -------
##   seqinfo: 357 sequences (1 circular) from GRCh38 genome

To pull out the gene names you use a $. This would give us a 60 thousand length character vector, so we just show the first 5.

g$gene_name[1:5]
## [1] "DDX11L1"   "WASH7P"    "MIR6859-1" "MIR1302-2" "FAM138A"

We can look at the distribution of genes by chromosome. The seqnames returns a factor-Rle where Rle stands for run length encoding. This is a way to compress a long vector which may contain repeated values in runs.

seqnames(g)
## factor-Rle of length 63970 with 357 runs
##   Lengths:                       5264                       2232 ...                        523
##   Values : 1                          10                         ...         Y                 
## Levels(357): 1 10 11 12 13 14 15 16 17 18 ... LRG_187 LRG_239 LRG_311 LRG_721 LRG_741 LRG_93 MT X Y

The object g has information about genes on the standard chromosomes, as well as genes or haplotypes of genes on haplotype chromosomes.

head(seqlevels(g),100)
##   [1] "1"                          "10"                         "11"                        
##   [4] "12"                         "13"                         "14"                        
##   [7] "15"                         "16"                         "17"                        
##  [10] "18"                         "19"                         "2"                         
##  [13] "20"                         "21"                         "22"                        
##  [16] "3"                          "4"                          "5"                         
##  [19] "6"                          "7"                          "8"                         
##  [22] "9"                          "CHR_HG107_PATCH"            "CHR_HG126_PATCH"           
##  [25] "CHR_HG1311_PATCH"           "CHR_HG1342_HG2282_PATCH"    "CHR_HG1362_PATCH"          
##  [28] "CHR_HG142_HG150_NOVEL_TEST" "CHR_HG151_NOVEL_TEST"       "CHR_HG1651_PATCH"          
##  [31] "CHR_HG1832_PATCH"           "CHR_HG2021_PATCH"           "CHR_HG2022_PATCH"          
##  [34] "CHR_HG2023_PATCH"           "CHR_HG2030_PATCH"           "CHR_HG2058_PATCH"          
##  [37] "CHR_HG2062_PATCH"           "CHR_HG2063_PATCH"           "CHR_HG2066_PATCH"          
##  [40] "CHR_HG2072_PATCH"           "CHR_HG2095_PATCH"           "CHR_HG2104_PATCH"          
##  [43] "CHR_HG2116_PATCH"           "CHR_HG2128_PATCH"           "CHR_HG2191_PATCH"          
##  [46] "CHR_HG2213_PATCH"           "CHR_HG2217_PATCH"           "CHR_HG2232_PATCH"          
##  [49] "CHR_HG2233_PATCH"           "CHR_HG2235_PATCH"           "CHR_HG2239_PATCH"          
##  [52] "CHR_HG2247_PATCH"           "CHR_HG2249_PATCH"           "CHR_HG2288_HG2289_PATCH"   
##  [55] "CHR_HG2290_PATCH"           "CHR_HG2291_PATCH"           "CHR_HG2334_PATCH"          
##  [58] "CHR_HG26_PATCH"             "CHR_HG986_PATCH"            "CHR_HSCHR10_1_CTG1"        
##  [61] "CHR_HSCHR10_1_CTG2"         "CHR_HSCHR10_1_CTG3"         "CHR_HSCHR10_1_CTG4"        
##  [64] "CHR_HSCHR10_1_CTG6"         "CHR_HSCHR11_1_CTG1_2"       "CHR_HSCHR11_1_CTG5"        
##  [67] "CHR_HSCHR11_1_CTG6"         "CHR_HSCHR11_1_CTG7"         "CHR_HSCHR11_1_CTG8"        
##  [70] "CHR_HSCHR11_2_CTG1"         "CHR_HSCHR11_2_CTG1_1"       "CHR_HSCHR11_3_CTG1"        
##  [73] "CHR_HSCHR12_1_CTG1"         "CHR_HSCHR12_1_CTG2_1"       "CHR_HSCHR12_2_CTG1"        
##  [76] "CHR_HSCHR12_2_CTG2"         "CHR_HSCHR12_2_CTG2_1"       "CHR_HSCHR12_3_CTG2"        
##  [79] "CHR_HSCHR12_3_CTG2_1"       "CHR_HSCHR12_4_CTG2"         "CHR_HSCHR12_4_CTG2_1"      
##  [82] "CHR_HSCHR12_5_CTG2"         "CHR_HSCHR12_5_CTG2_1"       "CHR_HSCHR12_6_CTG2_1"      
##  [85] "CHR_HSCHR13_1_CTG1"         "CHR_HSCHR13_1_CTG3"         "CHR_HSCHR13_1_CTG5"        
##  [88] "CHR_HSCHR13_1_CTG8"         "CHR_HSCHR14_1_CTG1"         "CHR_HSCHR14_2_CTG1"        
##  [91] "CHR_HSCHR14_3_CTG1"         "CHR_HSCHR14_7_CTG1"         "CHR_HSCHR15_1_CTG1"        
##  [94] "CHR_HSCHR15_1_CTG3"         "CHR_HSCHR15_1_CTG8"         "CHR_HSCHR15_2_CTG3"        
##  [97] "CHR_HSCHR15_2_CTG8"         "CHR_HSCHR15_3_CTG3"         "CHR_HSCHR15_3_CTG8"        
## [100] "CHR_HSCHR15_4_CTG8"

Let’s subset to just the “standard” chromosomes:

g <- keepStandardChromosomes(g, pruning.mode="coarse")

The following functions tell you the names and lengths of chromosomes, as well as the name of the genome (repeated for each chromosome).

seqlevels(g)
##  [1] "1"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2"  "20" "21" "22" "3"  "4"  "5"  "6" 
## [20] "7"  "8"  "9"  "MT" "X"  "Y"
seqinfo(g)
## Seqinfo object with 25 sequences (1 circular) from GRCh38 genome:
##   seqnames seqlengths isCircular genome
##   1         248956422      FALSE GRCh38
##   10        133797422      FALSE GRCh38
##   11        135086622      FALSE GRCh38
##   12        133275309      FALSE GRCh38
##   13        114364328      FALSE GRCh38
##   ...             ...        ...    ...
##   8         145138636      FALSE GRCh38
##   9         138394717      FALSE GRCh38
##   MT            16569       TRUE GRCh38
##   X         156040895      FALSE GRCh38
##   Y          57227415      FALSE GRCh38
genome(g)
##        1       10       11       12       13       14       15       16       17       18       19 
## "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" 
##        2       20       21       22        3        4        5        6        7        8        9 
## "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" 
##       MT        X        Y 
## "GRCh38" "GRCh38" "GRCh38"

We can get the number of genes on each chromosome:

table(seqnames(g))
## 
##    1   10   11   12   13   14   15   16   17   18   19    2   20   21   22    3    4    5    6    7 
## 5264 2232 3289 2969 1316 2244 2171 2538 3039 1181 2962 4024 1403  848 1337 3043 2519 2891 2888 2893 
##    8    9   MT    X    Y 
## 2369 2271   37 2399  523

Note that MT (mitochondrial DNA) has many more genes than the other chromosomes, per basepair:

tab <- table(seqnames(g))
tab
## 
##    1   10   11   12   13   14   15   16   17   18   19    2   20   21   22    3    4    5    6    7 
## 5264 2232 3289 2969 1316 2244 2171 2538 3039 1181 2962 4024 1403  848 1337 3043 2519 2891 2888 2893 
##    8    9   MT    X    Y 
## 2369 2271   37 2399  523
lens <- seqlengths(seqinfo(g))
lens
##         1        10        11        12        13        14        15        16        17        18 
## 248956422 133797422 135086622 133275309 114364328 107043718 101991189  90338345  83257441  80373285 
##        19         2        20        21        22         3         4         5         6         7 
##  58617616 242193529  64444167  46709983  50818468 198295559 190214555 181538259 170805979 159345973 
##         8         9        MT         X         Y 
## 145138636 138394717     16569 156040895  57227415
all.equal(names(tab), names(lens))
## [1] TRUE
barplot(tab/lens)

If we remove MT, we can see the range better:

barplot((tab/lens)[-which(names(tab) == "MT")], horiz=TRUE, las=1)
abline(v=(1:5)*1e-5, col=rgb(0,0,0,.5))

So roughly 1 gene per 50,000 basepairs, averaging per chromosome.

mean((lens/tab)[-which(names(tab) == "MT")])
## [1] 55799.91

Finding overlaps

Finally, I want to introduce some functionality we will use a couple times later in the course: finding overlaps of sets of genomic ranges. One scenario is that we have two sets and we want to know which ranges from the two sets overlap each other. We might make a query set and ask which genes are overlapping the set in our query.

query <- GRanges(1, IRanges((101:110)*1e6 + 1,width=1e6))
query
## GRanges object with 10 ranges and 0 metadata columns:
##        seqnames              ranges strand
##           <Rle>           <IRanges>  <Rle>
##    [1]        1 101000001-102000000      *
##    [2]        1 102000001-103000000      *
##    [3]        1 103000001-104000000      *
##    [4]        1 104000001-105000000      *
##    [5]        1 105000001-106000000      *
##    [6]        1 106000001-107000000      *
##    [7]        1 107000001-108000000      *
##    [8]        1 108000001-109000000      *
##    [9]        1 109000001-110000000      *
##   [10]        1 110000001-111000000      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

We can get the numeric indices of the overlaps using findOverlaps. Note that we can turn this resulting object into a data.frame, for further processing.

fo <- findOverlaps(query, g)
head(fo)
## Hits object with 6 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         1        2333
##   [2]         1        2335
##   [3]         1        2336
##   [4]         1        2337
##   [5]         1        2338
##   [6]         1        2339
##   -------
##   queryLength: 10 / subjectLength: 58650
tail(fo)
## Hits object with 6 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]        10        2490
##   [2]        10        2491
##   [3]        10        2492
##   [4]        10        2493
##   [5]        10        2494
##   [6]        10        2495
##   -------
##   queryLength: 10 / subjectLength: 58650
# to count the overlaps for each range in query:
table(factor(from(fo), seq_along(query)))
## 
##  1  2  3  4  5  6  7  8  9 10 
## 17  3 15  2  7  8  6 29 44 36

Equivalently, we can use the countOverlaps function for the last command above:

countOverlaps(query, g)
##  [1] 17  3 15  2  7  8  6 29 44 36

We can convert the Hits object to a data.frame:

fo <- as.data.frame(fo)

We can also find out which genes overlap which other genes, by just providing a single set of ranges to findOverlaps. There are many overlapping genes in the human genome, though you should note that we obtain at least two rows in the resulting Hits object for every overlap, and so we can filter this object down by insisting that one index be strictly less than the other.

fo <- findOverlaps(g)
head(fo)
## SelfHits object with 6 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         1           1
##   [2]         2           2
##   [3]         2           3
##   [4]         3           2
##   [5]         3           3
##   [6]         4           4
##   -------
##   queryLength: 58650 / subjectLength: 58650
fo <- as.data.frame(fo)
fo <- fo[fo[,1] < fo[,2],]
head(fo)
##    queryHits subjectHits
## 3          2           3
## 12         9          10
## 20        15          16
## 26        19          20
## 30        21          22
## 37        26          28
nrow(fo)
## [1] 12078

Earlier, when we defined query, we didn’t consider strand, and so the strand of the ranges was defined as *. This means that we get overlaps of any genes whether + or -. But when we ask whether the genes overlap each other, by default findOverlaps will not count an overlap if the ranges are on different strand. To ask how many genes overlap each other ignoring strand, we need to specify ignore.strand=TRUE (which produces more overlaps).

fo <- findOverlaps(g, ignore.strand=TRUE)
fo <- as.data.frame(fo)
fo <- fo[fo[,1] < fo[,2],]
nrow(fo)
## [1] 29470

If we just wanted to subset to the genes which overlap a given range, we can use overlapsAny:

g[overlapsAny(g, query[1])]
## GRanges object with 17 ranges and 6 metadata columns:
##                   seqnames              ranges strand |         gene_id     gene_name
##                      <Rle>           <IRanges>  <Rle> |     <character>   <character>
##   ENSG00000117543        1 100989623-101026088      - | ENSG00000117543          DPH5
##   ENSG00000233184        1 101025878-101087268      + | ENSG00000233184 RP11-421L21.3
##   ENSG00000252765        1 101133153-101133339      + | ENSG00000252765      SCARNA16
##   ENSG00000271578        1 101190520-101190994      + | ENSG00000271578  RP5-1033H2.1
##   ENSG00000225938        1 101235683-101236528      - | ENSG00000225938   RP4-575N6.4
##               ...      ...                 ...    ... .             ...           ...
##   ENSG00000183298        1 101786340-101787219      - | ENSG00000183298       RPSAP19
##   ENSG00000118733        1 101802574-101997030      - | ENSG00000118733         OLFM3
##   ENSG00000252717        1 101859851-101859957      + | ENSG00000252717     RNU6-352P
##   ENSG00000271277        1 101882516-101882894      + | ENSG00000271277   RP11-82O2.2
##   ENSG00000162699        1 101893105-101894047      + | ENSG00000162699      DNAJA1P5
##                           gene_biotype seq_coord_system        symbol entrezid
##                            <character>      <character>   <character>   <list>
##   ENSG00000117543       protein_coding       chromosome          DPH5    51611
##   ENSG00000233184            antisense       chromosome RP11-421L21.3     <NA>
##   ENSG00000252765               scaRNA       chromosome      SCARNA16     <NA>
##   ENSG00000271578 processed_pseudogene       chromosome  RP5-1033H2.1     <NA>
##   ENSG00000225938            antisense       chromosome   RP4-575N6.4     <NA>
##               ...                  ...              ...           ...      ...
##   ENSG00000183298 processed_pseudogene       chromosome       RPSAP19     <NA>
##   ENSG00000118733       protein_coding       chromosome         OLFM3   118427
##   ENSG00000252717                snRNA       chromosome     RNU6-352P     <NA>
##   ENSG00000271277 processed_pseudogene       chromosome   RP11-82O2.2     <NA>
##   ENSG00000162699 processed_pseudogene       chromosome      DNAJA1P5     <NA>
##   -------
##   seqinfo: 25 sequences (1 circular) from GRCh38 genome

This is equivalent to the following:

g[g %over% query[1]]
## GRanges object with 17 ranges and 6 metadata columns:
##                   seqnames              ranges strand |         gene_id     gene_name
##                      <Rle>           <IRanges>  <Rle> |     <character>   <character>
##   ENSG00000117543        1 100989623-101026088      - | ENSG00000117543          DPH5
##   ENSG00000233184        1 101025878-101087268      + | ENSG00000233184 RP11-421L21.3
##   ENSG00000252765        1 101133153-101133339      + | ENSG00000252765      SCARNA16
##   ENSG00000271578        1 101190520-101190994      + | ENSG00000271578  RP5-1033H2.1
##   ENSG00000225938        1 101235683-101236528      - | ENSG00000225938   RP4-575N6.4
##               ...      ...                 ...    ... .             ...           ...
##   ENSG00000183298        1 101786340-101787219      - | ENSG00000183298       RPSAP19
##   ENSG00000118733        1 101802574-101997030      - | ENSG00000118733         OLFM3
##   ENSG00000252717        1 101859851-101859957      + | ENSG00000252717     RNU6-352P
##   ENSG00000271277        1 101882516-101882894      + | ENSG00000271277   RP11-82O2.2
##   ENSG00000162699        1 101893105-101894047      + | ENSG00000162699      DNAJA1P5
##                           gene_biotype seq_coord_system        symbol entrezid
##                            <character>      <character>   <character>   <list>
##   ENSG00000117543       protein_coding       chromosome          DPH5    51611
##   ENSG00000233184            antisense       chromosome RP11-421L21.3     <NA>
##   ENSG00000252765               scaRNA       chromosome      SCARNA16     <NA>
##   ENSG00000271578 processed_pseudogene       chromosome  RP5-1033H2.1     <NA>
##   ENSG00000225938            antisense       chromosome   RP4-575N6.4     <NA>
##               ...                  ...              ...           ...      ...
##   ENSG00000183298 processed_pseudogene       chromosome       RPSAP19     <NA>
##   ENSG00000118733       protein_coding       chromosome         OLFM3   118427
##   ENSG00000252717                snRNA       chromosome     RNU6-352P     <NA>
##   ENSG00000271277 processed_pseudogene       chromosome   RP11-82O2.2     <NA>
##   ENSG00000162699 processed_pseudogene       chromosome      DNAJA1P5     <NA>
##   -------
##   seqinfo: 25 sequences (1 circular) from GRCh38 genome

plyranges

As with SummarizedExperiment, there is a “tidy” way to work with GRanges, called plyranges. For more examples of this type of analysis, take a look at some tutorial content I’ve been working on here:

https://nullranges.github.io/tidy-ranges-tutorial

library(plyranges)
g %>% 
  group_by(gene_biotype) %>%
  summarize(width = mean(width))
## DataFrame with 46 rows and 2 columns
##               gene_biotype     width
##                <character> <numeric>
## 1   3prime_overlapping_n..   7001.43
## 2                antisense  20910.71
## 3   bidirectional_promot..  91166.25
## 4                IG_C_gene   2365.29
## 5          IG_C_pseudogene   1012.44
## ...                    ...       ...
## 42  transcribed_unitary_..  50086.47
## 43  transcribed_unproces..  28365.94
## 44      unitary_pseudogene  15708.17
## 45  unprocessed_pseudogene   5417.64
## 46                vaultRNA    127.00

Count overlaps of genes in the query ranges:

query %>%
  mutate(num_genes = count_overlaps(., g))
## GRanges object with 10 ranges and 1 metadata column:
##        seqnames              ranges strand | num_genes
##           <Rle>           <IRanges>  <Rle> | <integer>
##    [1]        1 101000001-102000000      * |        17
##    [2]        1 102000001-103000000      * |         3
##    [3]        1 103000001-104000000      * |        15
##    [4]        1 104000001-105000000      * |         2
##    [5]        1 105000001-106000000      * |         7
##    [6]        1 106000001-107000000      * |         8
##    [7]        1 107000001-108000000      * |         6
##    [8]        1 108000001-109000000      * |        29
##    [9]        1 109000001-110000000      * |        44
##   [10]        1 110000001-111000000      * |        36
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

We can also do the same types of operations as before, with “base GRanges”:

g %>%
  slice(1:5) %>%
  select(symbol) %>%
  anchor_5p() %>%
  mutate(width=1) # same as resize(..., width=1)
## GRanges object with 5 ranges and 1 metadata column:
##                   seqnames    ranges strand |      symbol
##                      <Rle> <IRanges>  <Rle> | <character>
##   ENSG00000223972        1     11869      + |     DDX11L1
##   ENSG00000227232        1     29570      - |      WASH7P
##   ENSG00000278267        1     17436      - |   MIR6859-1
##   ENSG00000243485        1     29554      + |   MIR1302-2
##   ENSG00000237613        1     36081      - |     FAM138A
##   -------
##   seqinfo: 25 sequences (1 circular) from GRCh38 genome

Plot genes per chromosome:

library(tibble)
library(ggplot2)
g %>% 
  group_by(seqnames) %>%
  summarize(num_genes = n()) %>%
  as_tibble() %>%
  ggplot(aes(seqnames, num_genes)) +
  geom_col()

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] ggplot2_3.3.6             tibble_3.1.7              plyranges_1.17.0         
##  [4] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.21.2          AnnotationFilter_1.21.0  
##  [7] GenomicFeatures_1.49.5    AnnotationDbi_1.59.1      Biobase_2.57.1           
## [10] rafalib_1.0.0             GenomicRanges_1.49.0      GenomeInfoDb_1.33.3      
## [13] IRanges_2.31.0            S4Vectors_0.35.1          BiocGenerics_0.43.0      
## [16] pkgdown_2.0.5             testthat_3.1.4            rmarkdown_2.14           
## [19] devtools_2.4.3            usethis_2.1.6            
## 
## loaded via a namespace (and not attached):
##  [1] ProtGenerics_1.29.0         matrixStats_0.62.0          bitops_1.0-7               
##  [4] fs_1.5.2                    bit64_4.0.5                 filelock_1.0.2             
##  [7] RColorBrewer_1.1-3          progress_1.2.2              httr_1.4.3                 
## [10] tools_4.2.1                 bslib_0.3.1                 utf8_1.2.2                 
## [13] R6_2.5.1                    colorspace_2.0-3            lazyeval_0.2.2             
## [16] DBI_1.1.3                   withr_2.5.0                 tidyselect_1.1.2           
## [19] prettyunits_1.1.1           processx_3.6.1              bit_4.0.4                  
## [22] curl_4.3.2                  compiler_4.2.1              cli_3.3.0                  
## [25] xml2_1.3.3                  DelayedArray_0.23.0         labeling_0.4.2             
## [28] rtracklayer_1.57.0          sass_0.4.1                  scales_1.2.0               
## [31] callr_3.7.0                 rappdirs_0.3.3              stringr_1.4.0              
## [34] digest_0.6.29               Rsamtools_2.13.3            XVector_0.37.0             
## [37] pkgconfig_2.0.3             htmltools_0.5.2             sessioninfo_1.2.2          
## [40] MatrixGenerics_1.9.1        dbplyr_2.2.1                fastmap_1.1.0              
## [43] highr_0.9                   rlang_1.0.3                 rstudioapi_0.13            
## [46] RSQLite_2.2.14              farver_2.1.1                jquerylib_0.1.4            
## [49] BiocIO_1.7.1                generics_0.1.3              jsonlite_1.8.0             
## [52] BiocParallel_1.31.10        dplyr_1.0.9                 RCurl_1.98-1.7             
## [55] magrittr_2.0.3              GenomeInfoDbData_1.2.8      Matrix_1.4-1               
## [58] munsell_0.5.0               Rcpp_1.0.8.3                fansi_1.0.3                
## [61] lifecycle_1.0.1             stringi_1.7.6               yaml_2.3.5                 
## [64] SummarizedExperiment_1.27.1 zlibbioc_1.43.0             brio_1.1.3                 
## [67] pkgbuild_1.3.1              BiocFileCache_2.5.0         grid_4.2.1                 
## [70] blob_1.2.3                  parallel_4.2.1              crayon_1.5.1               
## [73] lattice_0.20-45             Biostrings_2.65.1           hms_1.1.1                  
## [76] KEGGREST_1.37.2             knitr_1.39                  ps_1.7.1                   
## [79] pillar_1.7.0                rjson_0.2.21                codetools_0.2-18           
## [82] biomaRt_2.53.2              pkgload_1.3.0               XML_3.99-0.10              
## [85] glue_1.6.2                  evaluate_0.15               remotes_2.4.2              
## [88] png_0.1-7                   vctrs_0.4.1                 gtable_0.3.0               
## [91] purrr_0.3.4                 assertthat_0.2.1            cachem_1.0.6               
## [94] xfun_0.31                   restfulr_0.0.15             GenomicAlignments_1.33.0   
## [97] memoise_2.0.1               ellipsis_0.3.2