Genomic ranges: GRanges

Author

In this lecture note, we demonstrate how to specify regions of the genome (genomic ranges) in R/Bioconductor. We will make use of the GenomicRanges package and the GRanges object defined by this package. The package also provides a number of methods that work on this type of object. Finally we will show some extensions to related packages, and other ways of working with genomic range data.

A GRanges object specifies:

  1. the basepairs involved in the range (these are integer ranges)

  2. the strand (+ or -)

  3. the chromosome

  4. additional metadata (the “score” for the feature, some statistics, etc.)

Chromosomes in Bioconductor are called “sequences” and specified with the column seqnames, because sometimes we may want to specify ranges of molecules that are not chromosomes, for example the mitochondrial genome, viral DNA, un-incorporated contigs, etc. Although the package name is GenomicRanges, we don’t specify that the molecule is necessarily DNA, as it could also in theory be RNA or protein sequence within which we are referring to ranges.

Integer ranges

Let’s start with a set of integer ranges, an IRanges object.

library(GenomicRanges)
ir <- IRanges(start=100 + 1:5, width=10)
ir
IRanges object with 5 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]       101       110        10
  [2]       102       111        10
  [3]       103       112        10
  [4]       104       113        10
  [5]       105       114        10

We could also specify the same ranges by setting the end:

ir <- IRanges(start=100 + 1:5, end=100 + 10:14)
ir
IRanges object with 5 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]       101       110        10
  [2]       102       111        10
  [3]       103       112        10
  [4]       104       113        10
  [5]       105       114        10

We can make a plot of these ranges. It’s a bit different than a typical plot, because I want to point out that we have discrete locations in the genome (so the position isn’t defined over \(\mathbb{R}\) but over \(\mathbb{N}\)). So I’ve shifted the positions of the labels by 0.5 (see the axis command). Another detail to notice is that the ranges are inclusive. So the first range, from 101 to 110 includes the 110 basepair. So I’ve added +1 when I draw the arrows to show it includes the last basepair.

Note that the convention used in Bioconductor to define interval ranges is not universal. The BED convention is that the range does not include the start position of the range. So the integer range written 101 110 in BED format would not include the 101th basepair. These details matter when we want to refer to specific sequences in the genome, e.g. the transcription start site, or TSS, of a gene.

library(rafalib)
par(mar=c(2.5,2.5,.5,.5))
nullplot(start(range(ir)), end(range(ir))+1, 0, 6, xaxt="n")
abline(v=100:115, col="grey")
axis(1, at=100:115 + .5, labels=100:115) 
arrows(start(ir),seq_along(ir),end(ir)+1,seq_along(ir),
       lwd=3,code=3,angle=90,length=.05)

Genomic ranges

So far we’ve just shown integer ranges. To specify a genomic range, we also will come up with the strand of the ranges (+ or -) and the chromosome.

st <- rep(c("+","-"),c(3,2))
st
[1] "+" "+" "+" "-" "-"
gr <- GRanges(seqnames="chr1", ranges=ir, strand=st)
gr
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1   101-110      +
  [2]     chr1   102-111      +
  [3]     chr1   103-112      +
  [4]     chr1   104-113      -
  [5]     chr1   105-114      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Note that GRanges objects, when printed, also have a note about the genome. Here it says that we have an unspecified genome, and there is no information about the seqlengths, the lengths of the chromosomes. This information (basically in what universe do the ranges live) is important to track along with the ranges themselves, in case the user asks to, e.g. move all ranges 1 million basepairs to the right. This is not necessarily possible if we have a range that ends less than 1 million basepairs from the end of a chromosome. Or for example, if you want to randomly select features from the genome, or randomly shuffle the locations of features in the genome, we needto know where the component sequences start and end.

So the GRanges object really needs to also contain details about the genome itself.

BED files and GRanges

As we previewed above, another common format for representing ranges is the plain-text BED format, where the intervals are described in an open-closed format: exclusive of the basepair on the left side, inclusive of the basepair on the right side. As these files are common, we quickly can show the equivalent code for reading/manipulating genomic ranges using bedtools or in Bioconductor:

First, read in some data from BED file format. Here we have some transcript ranges from Ensembl v86 in the hg38 human genome. We also have a set of three query ranges (note the different strand of these).

library(here)
chr1query <- plyranges::read_bed(here("bioc","chr1_query.bed"))
txp <- plyranges::read_bed(here("bioc","chr1_100Mb_v86_txps.bed"))
genome(chr1query) <- "hg38"
genome(txp) <- "hg38"
chr1query
GRanges object with 3 ranges and 2 metadata columns:
      seqnames              ranges strand |        name     score
         <Rle>           <IRanges>  <Rle> | <character> <numeric>
  [1]     chr1 100500001-100600000      * |         foo        99
  [2]     chr1 101200001-101400000      + |         foo        99
  [3]     chr1 101700001-101900000      - |         foo        99
  -------
  seqinfo: 1 sequence from hg38 genome; no seqlengths
txp
GRanges object with 112 ranges and 2 metadata columns:
        seqnames              ranges strand |            name     score
           <Rle>           <IRanges>  <Rle> |     <character> <numeric>
    [1]     chr1  99969789-100026892      + | ENST00000465289         0
    [2]     chr1  99969789-100035637      + | ENST00000533028         0
    [3]     chr1  99969979-100022956      + | ENST00000427993         0
    [4]     chr1  99970436-100026979      + | ENST00000370153         0
    [5]     chr1  99993043-100015341      + | ENST00000422078         0
    ...      ...                 ...    ... .             ...       ...
  [108]     chr1 101803486-101837029      - | ENST00000536598         0
  [109]     chr1 101825090-101910154      - | ENST00000468901         0
  [110]     chr1 101859851-101859957      + | ENST00000516908         0
  [111]     chr1 101882516-101882894      + | ENST00000603614         0
  [112]     chr1 101893105-101894047      + | ENST00000440278         0
  -------
  seqinfo: 1 sequence from hg38 genome; no seqlengths

We can see the number of transcripts that overlap our query ranges. We can ask for this number in different ways, by specifying what to do with the strand information: count only matching strand or missing strand, count only matching strand for the features with strand information, or count but we ignore strand altogether.

The overlapsAny function returns a logical vector along the object that is its first argument (here txp).

txp[ overlapsAny(txp, chr1query) ] # 19
GRanges object with 19 ranges and 2 metadata columns:
       seqnames              ranges strand |            name     score
          <Rle>           <IRanges>  <Rle> |     <character> <numeric>
   [1]     chr1 100352600-100520277      + | ENST00000336454         0
   [2]     chr1 100538137-100542018      + | ENST00000315033         0
   [3]     chr1 100586649-100587431      - | ENST00000455141         0
   [4]     chr1 101236888-101241518      + | ENST00000305352         0
   [5]     chr1 101236927-101239519      + | ENST00000475821         0
   ...      ...                 ...    ... .             ...       ...
  [15]     chr1 101803486-101846973      - | ENST00000338858         0
  [16]     chr1 101803486-101846973      - | ENST00000465523         0
  [17]     chr1 101803486-101996905      - | ENST00000462354         0
  [18]     chr1 101803486-101837029      - | ENST00000536598         0
  [19]     chr1 101825090-101910154      - | ENST00000468901         0
  -------
  seqinfo: 1 sequence from hg38 genome; no seqlengths
txp[ overlapsAny(txp, chr1query[2:3]) ] # 16
GRanges object with 16 ranges and 2 metadata columns:
       seqnames              ranges strand |            name     score
          <Rle>           <IRanges>  <Rle> |     <character> <numeric>
   [1]     chr1 101236888-101241518      + | ENST00000305352         0
   [2]     chr1 101236927-101239519      + | ENST00000475821         0
   [3]     chr1 101238090-101239164      + | ENST00000475289         0
   [4]     chr1 101243158-101243749      + | ENST00000561748         0
   [5]     chr1 101256274-101256616      + | ENST00000455561         0
   ...      ...                 ...    ... .             ...       ...
  [12]     chr1 101803486-101846973      - | ENST00000338858         0
  [13]     chr1 101803486-101846973      - | ENST00000465523         0
  [14]     chr1 101803486-101996905      - | ENST00000462354         0
  [15]     chr1 101803486-101837029      - | ENST00000536598         0
  [16]     chr1 101825090-101910154      - | ENST00000468901         0
  -------
  seqinfo: 1 sequence from hg38 genome; no seqlengths
txp[ overlapsAny(txp, chr1query, ignore.strand=TRUE) ] # 26
GRanges object with 26 ranges and 2 metadata columns:
       seqnames              ranges strand |            name     score
          <Rle>           <IRanges>  <Rle> |     <character> <numeric>
   [1]     chr1 100352600-100520277      + | ENST00000336454         0
   [2]     chr1 100538137-100542018      + | ENST00000315033         0
   [3]     chr1 100586649-100587431      - | ENST00000455141         0
   [4]     chr1 101235683-101236528      - | ENST00000432195         0
   [5]     chr1 101236888-101241518      + | ENST00000305352         0
   ...      ...                 ...    ... .             ...       ...
  [22]     chr1 101803486-101837029      - | ENST00000536598         0
  [23]     chr1 101825090-101910154      - | ENST00000468901         0
  [24]     chr1 101859851-101859957      + | ENST00000516908         0
  [25]     chr1 101882516-101882894      + | ENST00000603614         0
  [26]     chr1 101893105-101894047      + | ENST00000440278         0
  -------
  seqinfo: 1 sequence from hg38 genome; no seqlengths

We can repeat these overlap calls using bedtools intersect (this is a command line program, not run in R):

bedtools intersect -wa -a chr1_100Mb_v86_txps.bed -b chr1_query.bed | head -3
chr1    100352599   100520277   ENST00000336454 0   +
chr1    100538136   100542018   ENST00000315033 0   +
chr1    100586648   100587431   ENST00000455141 0   -

Let’s count first by ignoring strand. This corresponds to the number above when we set ignore.strand=TRUE.

bedtools intersect -wa -a chr1_100Mb_v86_txps.bed -b chr1_query.bed | wc -l
      26

In bedtools you use an -s to specify you only want to return overlaps with matching strand (and we drop the unstranded features).

bedtools intersect -s -wa -a chr1_100Mb_v86_txps.bed -b chr1_query.bed | wc -l
      16

GRanges range manipulation

Here we show some more basic manipulations of GRanges:

gr + 10 # growing by 10 on either side
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    91-120      +
  [2]     chr1    92-121      +
  [3]     chr1    93-122      +
  [4]     chr1    94-123      -
  [5]     chr1    95-124      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
shift(gr, 10) # shifting by 10
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1   111-120      +
  [2]     chr1   112-121      +
  [3]     chr1   113-122      +
  [4]     chr1   114-123      -
  [5]     chr1   115-124      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
resize(gr, width=1) # 1 bp from the start (+) or end (-)
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1       101      +
  [2]     chr1       102      +
  [3]     chr1       103      +
  [4]     chr1       113      -
  [5]     chr1       114      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
flank(gr, width=2, both=TRUE) # +/- 2 bp from " or "
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    99-102      +
  [2]     chr1   100-103      +
  [3]     chr1   101-104      +
  [4]     chr1   112-115      -
  [5]     chr1   113-116      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Note that the last two commands, resize and flank, gave back ranges relative to the start of the ranges with positive strand, and to the end of the ranges with negative strand. This is particular useful for working with genes, in which we often care about the region surrounding the transcription start site or TSS of a gene. The TSS will be the left-most basepair of a gene on the positive strand, and the right-most basepair of a gene on the negative strand.

These functions would usually give us more information, but in this case the information was not specified:

seqnames(gr)
factor-Rle of length 5 with 1 run
  Lengths:    5
  Values : chr1
Levels(1): chr1
seqinfo(gr)
Seqinfo object with 1 sequence from an unspecified genome; no seqlengths:
  seqnames seqlengths isCircular genome
  chr1             NA         NA   <NA>

Instead of looking at a toy GRanges, let’s download some actual genomic ranges, the set of human genes, as annotated by the Ensembl project, version 86.

This data is easily accessible in Bioconductor using the ensembldb package, and the specific Ensembl database package from Bioconductor.

This says there are about 60 thousand genes (according to Ensembl).

library(ensembldb)
library(EnsDb.Hsapiens.v86) # this pkg is about 75 Mb
edb <- EnsDb.Hsapiens.v86
g <- genes(edb)
length(g)
[1] 63970

If you try to print g it will just show you the beginning and end:

g
GRanges object with 63970 ranges and 6 metadata columns:
                  seqnames            ranges strand |         gene_id   gene_name
                     <Rle>         <IRanges>  <Rle> |     <character> <character>
  ENSG00000223972        1       11869-14409      + | ENSG00000223972     DDX11L1
  ENSG00000227232        1       14404-29570      - | ENSG00000227232      WASH7P
  ENSG00000278267        1       17369-17436      - | ENSG00000278267   MIR6859-1
  ENSG00000243485        1       29554-31109      + | ENSG00000243485   MIR1302-2
  ENSG00000237613        1       34554-36081      - | ENSG00000237613     FAM138A
              ...      ...               ...    ... .             ...         ...
  ENSG00000224240        Y 26549425-26549743      + | ENSG00000224240     CYCSP49
  ENSG00000227629        Y 26586642-26591601      - | ENSG00000227629  SLC25A15P1
  ENSG00000237917        Y 26594851-26634652      - | ENSG00000237917     PARP4P1
  ENSG00000231514        Y 26626520-26627159      - | ENSG00000231514     FAM58CP
  ENSG00000235857        Y 56855244-56855488      + | ENSG00000235857     CTBP2P1
                            gene_biotype seq_coord_system      symbol
                             <character>      <character> <character>
  ENSG00000223972 transcribed_unproces..       chromosome     DDX11L1
  ENSG00000227232 unprocessed_pseudogene       chromosome      WASH7P
  ENSG00000278267                  miRNA       chromosome   MIR6859-1
  ENSG00000243485                lincRNA       chromosome   MIR1302-2
  ENSG00000237613                lincRNA       chromosome     FAM138A
              ...                    ...              ...         ...
  ENSG00000224240   processed_pseudogene       chromosome     CYCSP49
  ENSG00000227629 unprocessed_pseudogene       chromosome  SLC25A15P1
  ENSG00000237917 unprocessed_pseudogene       chromosome     PARP4P1
  ENSG00000231514   processed_pseudogene       chromosome     FAM58CP
  ENSG00000235857   processed_pseudogene       chromosome     CTBP2P1
                                        entrezid
                                          <list>
  ENSG00000223972 100287596,100287102,727856,...
  ENSG00000227232                           <NA>
  ENSG00000278267                      102466751
  ENSG00000243485            105376912,100302278
  ENSG00000237613           654835,645520,641702
              ...                            ...
  ENSG00000224240                           <NA>
  ENSG00000227629                           <NA>
  ENSG00000237917                           <NA>
  ENSG00000231514                           <NA>
  ENSG00000235857                           <NA>
  -------
  seqinfo: 357 sequences (1 circular) from GRCh38 genome

You can take a look at a single gene:

g[1]
GRanges object with 1 range and 6 metadata columns:
                  seqnames      ranges strand |         gene_id   gene_name           gene_biotype
                     <Rle>   <IRanges>  <Rle> |     <character> <character>            <character>
  ENSG00000223972        1 11869-14409      + | ENSG00000223972     DDX11L1 transcribed_unproces..
                  seq_coord_system      symbol                       entrezid
                       <character> <character>                         <list>
  ENSG00000223972       chromosome     DDX11L1 100287596,100287102,727856,...
  -------
  seqinfo: 357 sequences (1 circular) from GRCh38 genome

To pull out the gene names you use a $. This would give us a 60 thousand length character vector, so we just show the first 5.

g$gene_name[1:5]
[1] "DDX11L1"   "WASH7P"    "MIR6859-1" "MIR1302-2" "FAM138A"  

We can look at the distribution of genes by chromosome. The seqnames returns a factor-Rle where Rle stands for run length encoding. This is a way to compress a long vector which may contain repeated values in runs.

seqnames(g)
factor-Rle of length 63970 with 357 runs
  Lengths:                       5264                       2232 ...                        523
  Values : 1                          10                         ...         Y                 
Levels(357): 1 10 11 12 13 14 15 16 17 18 ... LRG_187 LRG_239 LRG_311 LRG_721 LRG_741 LRG_93 MT X Y

The object g has information about genes on the standard chromosomes, as well as genes or haplotypes of genes on haplotype chromosomes.

head(seqlevels(g),100)
  [1] "1"                          "10"                         "11"                        
  [4] "12"                         "13"                         "14"                        
  [7] "15"                         "16"                         "17"                        
 [10] "18"                         "19"                         "2"                         
 [13] "20"                         "21"                         "22"                        
 [16] "3"                          "4"                          "5"                         
 [19] "6"                          "7"                          "8"                         
 [22] "9"                          "CHR_HG107_PATCH"            "CHR_HG126_PATCH"           
 [25] "CHR_HG1311_PATCH"           "CHR_HG1342_HG2282_PATCH"    "CHR_HG1362_PATCH"          
 [28] "CHR_HG142_HG150_NOVEL_TEST" "CHR_HG151_NOVEL_TEST"       "CHR_HG1651_PATCH"          
 [31] "CHR_HG1832_PATCH"           "CHR_HG2021_PATCH"           "CHR_HG2022_PATCH"          
 [34] "CHR_HG2023_PATCH"           "CHR_HG2030_PATCH"           "CHR_HG2058_PATCH"          
 [37] "CHR_HG2062_PATCH"           "CHR_HG2063_PATCH"           "CHR_HG2066_PATCH"          
 [40] "CHR_HG2072_PATCH"           "CHR_HG2095_PATCH"           "CHR_HG2104_PATCH"          
 [43] "CHR_HG2116_PATCH"           "CHR_HG2128_PATCH"           "CHR_HG2191_PATCH"          
 [46] "CHR_HG2213_PATCH"           "CHR_HG2217_PATCH"           "CHR_HG2232_PATCH"          
 [49] "CHR_HG2233_PATCH"           "CHR_HG2235_PATCH"           "CHR_HG2239_PATCH"          
 [52] "CHR_HG2247_PATCH"           "CHR_HG2249_PATCH"           "CHR_HG2288_HG2289_PATCH"   
 [55] "CHR_HG2290_PATCH"           "CHR_HG2291_PATCH"           "CHR_HG2334_PATCH"          
 [58] "CHR_HG26_PATCH"             "CHR_HG986_PATCH"            "CHR_HSCHR10_1_CTG1"        
 [61] "CHR_HSCHR10_1_CTG2"         "CHR_HSCHR10_1_CTG3"         "CHR_HSCHR10_1_CTG4"        
 [64] "CHR_HSCHR10_1_CTG6"         "CHR_HSCHR11_1_CTG1_2"       "CHR_HSCHR11_1_CTG5"        
 [67] "CHR_HSCHR11_1_CTG6"         "CHR_HSCHR11_1_CTG7"         "CHR_HSCHR11_1_CTG8"        
 [70] "CHR_HSCHR11_2_CTG1"         "CHR_HSCHR11_2_CTG1_1"       "CHR_HSCHR11_3_CTG1"        
 [73] "CHR_HSCHR12_1_CTG1"         "CHR_HSCHR12_1_CTG2_1"       "CHR_HSCHR12_2_CTG1"        
 [76] "CHR_HSCHR12_2_CTG2"         "CHR_HSCHR12_2_CTG2_1"       "CHR_HSCHR12_3_CTG2"        
 [79] "CHR_HSCHR12_3_CTG2_1"       "CHR_HSCHR12_4_CTG2"         "CHR_HSCHR12_4_CTG2_1"      
 [82] "CHR_HSCHR12_5_CTG2"         "CHR_HSCHR12_5_CTG2_1"       "CHR_HSCHR12_6_CTG2_1"      
 [85] "CHR_HSCHR13_1_CTG1"         "CHR_HSCHR13_1_CTG3"         "CHR_HSCHR13_1_CTG5"        
 [88] "CHR_HSCHR13_1_CTG8"         "CHR_HSCHR14_1_CTG1"         "CHR_HSCHR14_2_CTG1"        
 [91] "CHR_HSCHR14_3_CTG1"         "CHR_HSCHR14_7_CTG1"         "CHR_HSCHR15_1_CTG1"        
 [94] "CHR_HSCHR15_1_CTG3"         "CHR_HSCHR15_1_CTG8"         "CHR_HSCHR15_2_CTG3"        
 [97] "CHR_HSCHR15_2_CTG8"         "CHR_HSCHR15_3_CTG3"         "CHR_HSCHR15_3_CTG8"        
[100] "CHR_HSCHR15_4_CTG8"        

Let’s subset to just the “standard” chromosomes:

g <- keepStandardChromosomes(g, pruning.mode="coarse")

The following functions tell you the names and lengths of chromosomes, as well as the name of the genome (repeated for each chromosome).

seqlevels(g)
 [1] "1"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2"  "20" "21" "22" "3"  "4"  "5"  "6" 
[20] "7"  "8"  "9"  "MT" "X"  "Y" 
seqinfo(g)
Seqinfo object with 25 sequences (1 circular) from GRCh38 genome:
  seqnames seqlengths isCircular genome
  1         248956422      FALSE GRCh38
  10        133797422      FALSE GRCh38
  11        135086622      FALSE GRCh38
  12        133275309      FALSE GRCh38
  13        114364328      FALSE GRCh38
  ...             ...        ...    ...
  8         145138636      FALSE GRCh38
  9         138394717      FALSE GRCh38
  MT            16569       TRUE GRCh38
  X         156040895      FALSE GRCh38
  Y          57227415      FALSE GRCh38
genome(g)
       1       10       11       12       13       14       15       16       17       18       19 
"GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" 
       2       20       21       22        3        4        5        6        7        8        9 
"GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" 
      MT        X        Y 
"GRCh38" "GRCh38" "GRCh38" 

We can get the number of genes on each chromosome:

table(seqnames(g))

   1   10   11   12   13   14   15   16   17   18   19    2   20   21   22    3    4    5    6    7 
5264 2232 3289 2969 1316 2244 2171 2538 3039 1181 2962 4024 1403  848 1337 3043 2519 2891 2888 2893 
   8    9   MT    X    Y 
2369 2271   37 2399  523 

Note that MT (mitochondrial DNA) has many more genes than the other chromosomes, per basepair:

tab <- table(seqnames(g))
tab

   1   10   11   12   13   14   15   16   17   18   19    2   20   21   22    3    4    5    6    7 
5264 2232 3289 2969 1316 2244 2171 2538 3039 1181 2962 4024 1403  848 1337 3043 2519 2891 2888 2893 
   8    9   MT    X    Y 
2369 2271   37 2399  523 
lens <- seqlengths(seqinfo(g))
lens
        1        10        11        12        13        14        15        16        17        18 
248956422 133797422 135086622 133275309 114364328 107043718 101991189  90338345  83257441  80373285 
       19         2        20        21        22         3         4         5         6         7 
 58617616 242193529  64444167  46709983  50818468 198295559 190214555 181538259 170805979 159345973 
        8         9        MT         X         Y 
145138636 138394717     16569 156040895  57227415 
all.equal(names(tab), names(lens))
[1] TRUE
barplot(tab/lens)

If we remove MT, we can see the range better:

barplot((tab/lens)[-which(names(tab) == "MT")], horiz=TRUE, las=1)
abline(v=(1:5)*1e-5, col=rgb(0,0,0,.5))

So roughly 1 gene per 50,000 basepairs, averaging per chromosome.

mean((lens/tab)[-which(names(tab) == "MT")])
[1] 55799.91

Finding overlaps

Now we return to the overlaps functionality that we briefly covered above. There are many ways to find overlaps. I first introduce the base Bioconductor functions for computing overlaps, although later I will show a way to compute overlaps using join-style operations, using a package called plyranges.

One scenario is that we have two sets of genomic ranges and we want to know which ranges from the two sets overlap each other. We might make a query set and ask which genes are overlapping the set in our query.

query <- GRanges(1, IRanges((101:110)*1e6 + 1,width=1e6))
query
GRanges object with 10 ranges and 0 metadata columns:
       seqnames              ranges strand
          <Rle>           <IRanges>  <Rle>
   [1]        1 101000001-102000000      *
   [2]        1 102000001-103000000      *
   [3]        1 103000001-104000000      *
   [4]        1 104000001-105000000      *
   [5]        1 105000001-106000000      *
   [6]        1 106000001-107000000      *
   [7]        1 107000001-108000000      *
   [8]        1 108000001-109000000      *
   [9]        1 109000001-110000000      *
  [10]        1 110000001-111000000      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

We can get the numeric indices of the overlaps using findOverlaps. Note that we can turn this resulting object into a data.frame, for further processing.

fo <- findOverlaps(query, g)
head(fo)
Hits object with 6 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1        2333
  [2]         1        2335
  [3]         1        2336
  [4]         1        2337
  [5]         1        2338
  [6]         1        2339
  -------
  queryLength: 10 / subjectLength: 58650
tail(fo)
Hits object with 6 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]        10        2490
  [2]        10        2491
  [3]        10        2492
  [4]        10        2493
  [5]        10        2494
  [6]        10        2495
  -------
  queryLength: 10 / subjectLength: 58650
# to count the overlaps for each range in query:
table(factor(from(fo), seq_along(query)))

 1  2  3  4  5  6  7  8  9 10 
17  3 15  2  7  8  6 29 44 36 

Equivalently, we can use the countOverlaps function for the last command above:

countOverlaps(query, g)
 [1] 17  3 15  2  7  8  6 29 44 36

We can convert the Hits object to a data.frame:

fo <- as.data.frame(fo)

We can also find out which genes overlap which other genes, by just providing a single set of ranges to findOverlaps. There are many overlapping genes in the human genome, though you should note that we obtain at least two rows in the resulting Hits object for every overlap, and so we can filter this object down by insisting that one index be strictly less than the other.

fo <- findOverlaps(g)
head(fo)
SelfHits object with 6 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         2           2
  [3]         2           3
  [4]         3           2
  [5]         3           3
  [6]         4           4
  -------
  queryLength: 58650 / subjectLength: 58650
fo <- as.data.frame(fo)
fo <- fo[fo[,1] < fo[,2],]
head(fo)
   queryHits subjectHits
3          2           3
12         9          10
20        15          16
26        19          20
30        21          22
37        26          28
nrow(fo)
[1] 12078

Earlier, when we defined query, we didn’t consider strand, and so the strand of the ranges was defined as *. This means that we get overlaps of any genes whether + or -. But when we ask whether the genes overlap each other, by default findOverlaps will not count an overlap if the ranges are on different strand. To ask how many genes overlap each other ignoring strand, we need to specify ignore.strand=TRUE (which produces more overlaps).

fo <- findOverlaps(g, ignore.strand=TRUE)
fo <- as.data.frame(fo)
fo <- fo[fo[,1] < fo[,2],]
nrow(fo)
[1] 29470

If we just wanted to subset to the genes which overlap a given range, we can use overlapsAny, as we saw above:

g[ overlapsAny(g, query[1]) ]
GRanges object with 17 ranges and 6 metadata columns:
                  seqnames              ranges strand |         gene_id     gene_name
                     <Rle>           <IRanges>  <Rle> |     <character>   <character>
  ENSG00000117543        1 100989623-101026088      - | ENSG00000117543          DPH5
  ENSG00000233184        1 101025878-101087268      + | ENSG00000233184 RP11-421L21.3
  ENSG00000252765        1 101133153-101133339      + | ENSG00000252765      SCARNA16
  ENSG00000271578        1 101190520-101190994      + | ENSG00000271578  RP5-1033H2.1
  ENSG00000225938        1 101235683-101236528      - | ENSG00000225938   RP4-575N6.4
              ...      ...                 ...    ... .             ...           ...
  ENSG00000183298        1 101786340-101787219      - | ENSG00000183298       RPSAP19
  ENSG00000118733        1 101802574-101997030      - | ENSG00000118733         OLFM3
  ENSG00000252717        1 101859851-101859957      + | ENSG00000252717     RNU6-352P
  ENSG00000271277        1 101882516-101882894      + | ENSG00000271277   RP11-82O2.2
  ENSG00000162699        1 101893105-101894047      + | ENSG00000162699      DNAJA1P5
                          gene_biotype seq_coord_system        symbol entrezid
                           <character>      <character>   <character>   <list>
  ENSG00000117543       protein_coding       chromosome          DPH5    51611
  ENSG00000233184            antisense       chromosome RP11-421L21.3     <NA>
  ENSG00000252765               scaRNA       chromosome      SCARNA16     <NA>
  ENSG00000271578 processed_pseudogene       chromosome  RP5-1033H2.1     <NA>
  ENSG00000225938            antisense       chromosome   RP4-575N6.4     <NA>
              ...                  ...              ...           ...      ...
  ENSG00000183298 processed_pseudogene       chromosome       RPSAP19     <NA>
  ENSG00000118733       protein_coding       chromosome         OLFM3   118427
  ENSG00000252717                snRNA       chromosome     RNU6-352P     <NA>
  ENSG00000271277 processed_pseudogene       chromosome   RP11-82O2.2     <NA>
  ENSG00000162699 processed_pseudogene       chromosome      DNAJA1P5     <NA>
  -------
  seqinfo: 25 sequences (1 circular) from GRCh38 genome

This is equivalent to the following (while the binary operator however doesn’t allow for arguments):

g[ g %over% query[1] ]
GRanges object with 17 ranges and 6 metadata columns:
                  seqnames              ranges strand |         gene_id     gene_name
                     <Rle>           <IRanges>  <Rle> |     <character>   <character>
  ENSG00000117543        1 100989623-101026088      - | ENSG00000117543          DPH5
  ENSG00000233184        1 101025878-101087268      + | ENSG00000233184 RP11-421L21.3
  ENSG00000252765        1 101133153-101133339      + | ENSG00000252765      SCARNA16
  ENSG00000271578        1 101190520-101190994      + | ENSG00000271578  RP5-1033H2.1
  ENSG00000225938        1 101235683-101236528      - | ENSG00000225938   RP4-575N6.4
              ...      ...                 ...    ... .             ...           ...
  ENSG00000183298        1 101786340-101787219      - | ENSG00000183298       RPSAP19
  ENSG00000118733        1 101802574-101997030      - | ENSG00000118733         OLFM3
  ENSG00000252717        1 101859851-101859957      + | ENSG00000252717     RNU6-352P
  ENSG00000271277        1 101882516-101882894      + | ENSG00000271277   RP11-82O2.2
  ENSG00000162699        1 101893105-101894047      + | ENSG00000162699      DNAJA1P5
                          gene_biotype seq_coord_system        symbol entrezid
                           <character>      <character>   <character>   <list>
  ENSG00000117543       protein_coding       chromosome          DPH5    51611
  ENSG00000233184            antisense       chromosome RP11-421L21.3     <NA>
  ENSG00000252765               scaRNA       chromosome      SCARNA16     <NA>
  ENSG00000271578 processed_pseudogene       chromosome  RP5-1033H2.1     <NA>
  ENSG00000225938            antisense       chromosome   RP4-575N6.4     <NA>
              ...                  ...              ...           ...      ...
  ENSG00000183298 processed_pseudogene       chromosome       RPSAP19     <NA>
  ENSG00000118733       protein_coding       chromosome         OLFM3   118427
  ENSG00000252717                snRNA       chromosome     RNU6-352P     <NA>
  ENSG00000271277 processed_pseudogene       chromosome   RP11-82O2.2     <NA>
  ENSG00000162699 processed_pseudogene       chromosome      DNAJA1P5     <NA>
  -------
  seqinfo: 25 sequences (1 circular) from GRCh38 genome

plyranges

As with SummarizedExperiment, there is a “tidy” way to work with GRanges, called plyranges. For more examples of this type of analysis, take a look at some tutorial content I’ve been working on here:

https://tidyomics.github.io/tidy-ranges-tutorial

library(plyranges)
g %>% 
  group_by(gene_biotype) %>%
  summarize(width = mean(width))
DataFrame with 46 rows and 2 columns
              gene_biotype     width
               <character> <numeric>
1   3prime_overlapping_n..   7001.43
2                antisense  20910.71
3   bidirectional_promot..  91166.25
4                IG_C_gene   2365.29
5          IG_C_pseudogene   1012.44
...                    ...       ...
42  transcribed_unitary_..  50086.47
43  transcribed_unproces..  28365.94
44      unitary_pseudogene  15708.17
45  unprocessed_pseudogene   5417.64
46                vaultRNA    127.00

Count overlaps of genes in the query ranges:

query %>%
  mutate(num_genes = count_overlaps(., g))
GRanges object with 10 ranges and 1 metadata column:
       seqnames              ranges strand | num_genes
          <Rle>           <IRanges>  <Rle> | <integer>
   [1]        1 101000001-102000000      * |        17
   [2]        1 102000001-103000000      * |         3
   [3]        1 103000001-104000000      * |        15
   [4]        1 104000001-105000000      * |         2
   [5]        1 105000001-106000000      * |         7
   [6]        1 106000001-107000000      * |         8
   [7]        1 107000001-108000000      * |         6
   [8]        1 108000001-109000000      * |        29
   [9]        1 109000001-110000000      * |        44
  [10]        1 110000001-111000000      * |        36
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

We can also do the same types of operations as before, with “base GRanges”:

g %>%
  slice(1:5) %>%
  select(symbol) %>%
  anchor_5p() %>%
  mutate(width=1) # same as resize(..., width=1)
GRanges object with 5 ranges and 1 metadata column:
                  seqnames    ranges strand |      symbol
                     <Rle> <IRanges>  <Rle> | <character>
  ENSG00000223972        1     11869      + |     DDX11L1
  ENSG00000227232        1     29570      - |      WASH7P
  ENSG00000278267        1     17436      - |   MIR6859-1
  ENSG00000243485        1     29554      + |   MIR1302-2
  ENSG00000237613        1     36081      - |     FAM138A
  -------
  seqinfo: 25 sequences (1 circular) from GRCh38 genome

Plot genes per chromosome:

library(tibble)
library(ggplot2)
g %>% 
  group_by(seqnames) %>%
  summarize(num_genes = n()) %>%
  as_tibble() %>%
  ggplot(aes(seqnames, num_genes)) +
  geom_col()

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] ggplot2_3.5.1             tibble_3.2.1              plyranges_1.24.0         
 [4] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.28.0          AnnotationFilter_1.28.0  
 [7] GenomicFeatures_1.56.0    AnnotationDbi_1.66.0      Biobase_2.64.0           
[10] here_1.0.1                rafalib_1.0.0             GenomicRanges_1.56.1     
[13] GenomeInfoDb_1.40.1       IRanges_2.38.1            S4Vectors_0.42.1         
[16] BiocGenerics_0.50.0       testthat_3.2.1.1          rmarkdown_2.27           
[19] devtools_2.4.5            usethis_3.0.0            

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                   bitops_1.0-8                remotes_2.5.0              
 [4] rlang_1.1.4                 magrittr_2.0.3              matrixStats_1.3.0          
 [7] compiler_4.4.1              RSQLite_2.3.7               png_0.1-8                  
[10] vctrs_0.6.5                 ProtGenerics_1.36.0         stringr_1.5.1              
[13] profvis_0.3.8               pkgconfig_2.0.3             crayon_1.5.3               
[16] fastmap_1.2.0               XVector_0.44.0              ellipsis_0.3.2             
[19] labeling_0.4.3              utf8_1.2.4                  Rsamtools_2.20.0           
[22] promises_1.3.0              sessioninfo_1.2.2           UCSC.utils_1.0.0           
[25] purrr_1.0.2                 bit_4.0.5                   xfun_0.46                  
[28] zlibbioc_1.50.0             cachem_1.1.0                jsonlite_1.8.8             
[31] blob_1.2.4                  later_1.3.2                 DelayedArray_0.30.1        
[34] BiocParallel_1.38.0         parallel_4.4.1              R6_2.5.1                   
[37] stringi_1.8.4               RColorBrewer_1.1-3          rtracklayer_1.64.0         
[40] pkgload_1.4.0               brio_1.1.5                  Rcpp_1.0.13                
[43] SummarizedExperiment_1.34.0 knitr_1.48                  httpuv_1.6.15              
[46] Matrix_1.7-0                tidyselect_1.2.1            rstudioapi_0.16.0          
[49] abind_1.4-5                 yaml_2.3.10                 codetools_0.2-20           
[52] miniUI_0.1.1.1              curl_5.2.1                  pkgbuild_1.4.4             
[55] lattice_0.22-6              withr_3.0.1                 shiny_1.9.1                
[58] KEGGREST_1.44.1             evaluate_0.24.0             urlchecker_1.0.1           
[61] Biostrings_2.72.1           pillar_1.9.0                MatrixGenerics_1.16.0      
[64] generics_0.1.3              rprojroot_2.0.4             RCurl_1.98-1.16            
[67] munsell_0.5.1               scales_1.3.0                xtable_1.8-4               
[70] glue_1.7.0                  lazyeval_0.2.2              tools_4.4.1                
[73] BiocIO_1.14.0               GenomicAlignments_1.40.0    fs_1.6.4                   
[76] XML_3.99-0.17               grid_4.4.1                  colorspace_2.1-1           
[79] GenomeInfoDbData_1.2.12     restfulr_0.0.15             cli_3.6.3                  
[82] fansi_1.0.6                 S4Arrays_1.4.1              dplyr_1.1.4                
[85] gtable_0.3.5                digest_0.6.36               SparseArray_1.4.8          
[88] farver_2.1.2                rjson_0.2.21                htmlwidgets_1.6.4          
[91] memoise_2.0.1               htmltools_0.5.8.1           lifecycle_1.0.4            
[94] httr_1.4.7                  mime_0.12                   bit64_4.0.5