In this lecture note, we demonstrate how to specify regions of the genome (genomic ranges) in R/Bioconductor. We will make use of the GenomicRanges package and the GRanges object defined by this package. The package also provides a number of methods that work on this type of object. Finally we will show some extensions to related packages, and other ways of working with genomic range data.
A GRanges object specifies:
the basepairs involved in the range (these are integer ranges)
the strand (+ or -)
the chromosome
additional metadata (the “score” for the feature, some statistics, etc.)
Chromosomes in Bioconductor are called “sequences” and specified with the column seqnames, because sometimes we may want to specify ranges of molecules that are not chromosomes, for example the mitochondrial genome, viral DNA, un-incorporated contigs, etc. Although the package name is GenomicRanges, we don’t specify that the molecule is necessarily DNA, as it could also in theory be RNA or protein sequence within which we are referring to ranges.
Integer ranges
Let’s start with a set of integer ranges, an IRanges object.
We can make a plot of these ranges. It’s a bit different than a typical plot, because I want to point out that we have discrete locations in the genome (so the position isn’t defined over \(\mathbb{R}\) but over \(\mathbb{N}\)). So I’ve shifted the positions of the labels by 0.5 (see the axis command). Another detail to notice is that the ranges are inclusive. So the first range, from 101 to 110 includes the 110 basepair. So I’ve added +1 when I draw the arrows to show it includes the last basepair.
Note that the convention used in Bioconductor to define interval ranges is not universal. The BED convention is that the range does not include the start position of the range. So the integer range written 101 110 in BED format would not include the 101th basepair. These details matter when we want to refer to specific sequences in the genome, e.g. the transcription start site, or TSS, of a gene.
So far we’ve just shown integer ranges. To specify a genomic range, we also will come up with the strand of the ranges (+ or -) and the chromosome.
st <-rep(c("+","-"),c(3,2))st
[1] "+" "+" "+" "-" "-"
gr <-GRanges(seqnames="chr1", ranges=ir, strand=st)gr
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 101-110 +
[2] chr1 102-111 +
[3] chr1 103-112 +
[4] chr1 104-113 -
[5] chr1 105-114 -
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Note that GRanges objects, when printed, also have a note about the genome. Here it says that we have an unspecified genome, and there is no information about the seqlengths, the lengths of the chromosomes. This information (basically in what universe do the ranges live) is important to track along with the ranges themselves, in case the user asks to, e.g. move all ranges 1 million basepairs to the right. This is not necessarily possible if we have a range that ends less than 1 million basepairs from the end of a chromosome. Or for example, if you want to randomly select features from the genome, or randomly shuffle the locations of features in the genome, we needto know where the component sequences start and end.
So the GRanges object really needs to also contain details about the genome itself.
BED files and GRanges
As we previewed above, another common format for representing ranges is the plain-text BED format, where the intervals are described in an open-closed format: exclusive of the basepair on the left side, inclusive of the basepair on the right side. As these files are common, we quickly can show the equivalent code for reading/manipulating genomic ranges using bedtools or in Bioconductor:
First, read in some data from BED file format. Here we have some transcript ranges from Ensembl v86 in the hg38 human genome. We also have a set of three query ranges (note the different strand of these).
We can see the number of transcripts that overlap our query ranges. We can ask for this number in different ways, by specifying what to do with the strand information: count only matching strand or missing strand, count only matching strand for the features with strand information, or count but we ignore strand altogether.
The overlapsAny function returns a logical vector along the object that is its first argument (here txp).
Here we show some more basic manipulations of GRanges:
gr +10# growing by 10 on either side
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 91-120 +
[2] chr1 92-121 +
[3] chr1 93-122 +
[4] chr1 94-123 -
[5] chr1 95-124 -
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
shift(gr, 10) # shifting by 10
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 111-120 +
[2] chr1 112-121 +
[3] chr1 113-122 +
[4] chr1 114-123 -
[5] chr1 115-124 -
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
resize(gr, width=1) # 1 bp from the start (+) or end (-)
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 101 +
[2] chr1 102 +
[3] chr1 103 +
[4] chr1 113 -
[5] chr1 114 -
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
flank(gr, width=2, both=TRUE) # +/- 2 bp from " or "
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 99-102 +
[2] chr1 100-103 +
[3] chr1 101-104 +
[4] chr1 112-115 -
[5] chr1 113-116 -
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Note that the last two commands, resize and flank, gave back ranges relative to the start of the ranges with positive strand, and to the end of the ranges with negative strand. This is particular useful for working with genes, in which we often care about the region surrounding the transcription start site or TSS of a gene. The TSS will be the left-most basepair of a gene on the positive strand, and the right-most basepair of a gene on the negative strand.
These functions would usually give us more information, but in this case the information was not specified:
seqnames(gr)
factor-Rle of length 5 with 1 run
Lengths: 5
Values : chr1
Levels(1): chr1
seqinfo(gr)
Seqinfo object with 1 sequence from an unspecified genome; no seqlengths:
seqnames seqlengths isCircular genome
chr1 NA NA <NA>
Instead of looking at a toy GRanges, let’s download some actual genomic ranges, the set of human genes, as annotated by the Ensembl project, version 86.
This data is easily accessible in Bioconductor using the ensembldb package, and the specific Ensembl database package from Bioconductor.
This says there are about 60 thousand genes (according to Ensembl).
library(ensembldb)library(EnsDb.Hsapiens.v86) # this pkg is about 75 Mbedb <- EnsDb.Hsapiens.v86g <-genes(edb)length(g)
[1] 63970
If you try to print g it will just show you the beginning and end:
We can look at the distribution of genes by chromosome. The seqnames returns a factor-Rle where Rle stands for run length encoding. This is a way to compress a long vector which may contain repeated values in runs.
seqnames(g)
factor-Rle of length 63970 with 357 runs
Lengths: 5264 2232 ... 523
Values : 1 10 ... Y
Levels(357): 1 10 11 12 13 14 15 16 17 18 ... LRG_187 LRG_239 LRG_311 LRG_721 LRG_741 LRG_93 MT X Y
The object g has information about genes on the standard chromosomes, as well as genes or haplotypes of genes on haplotype chromosomes.
So roughly 1 gene per 50,000 basepairs, averaging per chromosome.
mean((lens/tab)[-which(names(tab) =="MT")])
[1] 55799.91
Finding overlaps
Now we return to the overlaps functionality that we briefly covered above. There are many ways to find overlaps. I first introduce the base Bioconductor functions for computing overlaps, although later I will show a way to compute overlaps using join-style operations, using a package called plyranges.
One scenario is that we have two sets of genomic ranges and we want to know which ranges from the two sets overlap each other. We might make a query set and ask which genes are overlapping the set in our query.
We can get the numeric indices of the overlaps using findOverlaps. Note that we can turn this resulting object into a data.frame, for further processing.
# to count the overlaps for each range in query:table(factor(from(fo), seq_along(query)))
1 2 3 4 5 6 7 8 9 10
17 3 15 2 7 8 6 29 44 36
Equivalently, we can use the countOverlaps function for the last command above:
countOverlaps(query, g)
[1] 17 3 15 2 7 8 6 29 44 36
We can convert the Hits object to a data.frame:
fo <-as.data.frame(fo)
We can also find out which genes overlap which other genes, by just providing a single set of ranges to findOverlaps. There are many overlapping genes in the human genome, though you should note that we obtain at least two rows in the resulting Hits object for every overlap, and so we can filter this object down by insisting that one index be strictly less than the other.
Earlier, when we defined query, we didn’t consider strand, and so the strand of the ranges was defined as *. This means that we get overlaps of any genes whether + or -. But when we ask whether the genes overlap each other, by default findOverlaps will not count an overlap if the ranges are on different strand. To ask how many genes overlap each other ignoring strand, we need to specify ignore.strand=TRUE (which produces more overlaps).
As with SummarizedExperiment, there is a “tidy” way to work with GRanges, called plyranges. For more examples of this type of analysis, take a look at some tutorial content I’ve been working on here: