Here we show off the way we specify genomic ranges in Bioconductor, using the GenomicRanges package and the GRanges class.
A GRanges object specifies both the basepairs involved in the range (these are integer ranges), the strand (+ or -), and the chromosome. Chromosomes in Bioconductor are called “sequences” (because sometimes we may want to specify ranges of non-chromosomes, for example viral DNA, un-incorporated contigs, etc.).
Let’s start with a set of interval ranges, an IRanges object.
library(GenomicRanges)
ir <- IRanges(start=100 + 1:5, width=10)
ir
## IRanges object with 5 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 101 110 10
## [2] 102 111 10
## [3] 103 112 10
## [4] 104 113 10
## [5] 105 114 10
We could also specify the same ranges by setting the end:
ir <- IRanges(start=100 + 1:5, end=100 + 10:14)
ir
## IRanges object with 5 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 101 110 10
## [2] 102 111 10
## [3] 103 112 10
## [4] 104 113 10
## [5] 105 114 10
We can make a plot of these ranges. It’s a bit different than a
typical plot, because I want to point out that we have discrete
locations in the genome (so the position isn’t defined over \(\mathbb{R}\) but over \(\mathbb{N}\)). So I’ve shifted the
positions of the labels by 0.5 (see the axis
command).
Another detail to notice is that the ranges are inclusive. So
the first range, from 101 to 110 includes the 110 basepair. So
I’ve added +1 when I draw the arrows to show it includes the last
basepair.
(Note that the convention used in Bioconductor to define interval ranges is not universal. The BED convention is that the range does not include the start position of the range. So 101-110 in BED format would not include the 101th basepair.)
library(rafalib)
nullplot(start(range(ir)), end(range(ir))+1, 0, 6, xaxt="n")
abline(v=100:115, col="grey")
axis(1, at=100:115 + .5, labels=100:115)
arrows(start(ir),seq_along(ir),end(ir)+1,seq_along(ir),
lwd=3,code=3,angle=90,length=.05)
So far we’ve just shown interval ranges. To specify a genomic range, we also will come up with the strand of the ranges (+ or -) and the chromosome.
st <- rep(c("+","-"),c(3,2))
st
## [1] "+" "+" "+" "-" "-"
gr <- GRanges(seqnames="chr1", ranges=ir, strand=st)
gr
## GRanges object with 5 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 101-110 +
## [2] chr1 102-111 +
## [3] chr1 103-112 +
## [4] chr1 104-113 -
## [5] chr1 105-114 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Note that GRanges objects, when printed, also have a note
about the genome. Here it says that we have an unspecified genome, and
there is no information about the seqlengths
, the lengths
of the chromosomes. This information (basically in what universe do the
ranges live) is important to track along with the ranges themselves, in
case the user asks to, e.g. move all ranges 1 million basepairs to the
right. This is not necessarily possible if we have a range that ends
less than 1 million basepairs from the end of a chromosome. So the
GRanges object really needs to also contain details about the
genome itself.
Here we show some basic manipulations of GRanges:
gr + 10 # growing by 10 on either side
## GRanges object with 5 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 91-120 +
## [2] chr1 92-121 +
## [3] chr1 93-122 +
## [4] chr1 94-123 -
## [5] chr1 95-124 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
shift(gr, 10) # shifting by 10
## GRanges object with 5 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 111-120 +
## [2] chr1 112-121 +
## [3] chr1 113-122 +
## [4] chr1 114-123 -
## [5] chr1 115-124 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
resize(gr, width=1) # 1 bp from the start (+) or end (-)
## GRanges object with 5 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 101 +
## [2] chr1 102 +
## [3] chr1 103 +
## [4] chr1 113 -
## [5] chr1 114 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
flank(gr, width=2, both=TRUE) # +/- 2 bp from " or "
## GRanges object with 5 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 99-102 +
## [2] chr1 100-103 +
## [3] chr1 101-104 +
## [4] chr1 112-115 -
## [5] chr1 113-116 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Note that the last two commands, resize
and
flank
, gave back ranges relative to the start of
the ranges with positive strand, and to the end of the ranges
with negative strand. This is particular useful for working with genes,
in which we often care about the region surrounding the
transcription start site or TSS of a gene. The TSS will be the
left-most basepair of a gene on the positive strand, and the right-most
basepair of a gene on the negative strand.
These functions would usually give us more information, but in this case the information was not specified:
seqnames(gr)
## factor-Rle of length 5 with 1 run
## Lengths: 5
## Values : chr1
## Levels(1): chr1
seqinfo(gr)
## Seqinfo object with 1 sequence from an unspecified genome; no seqlengths:
## seqnames seqlengths isCircular genome
## chr1 NA NA <NA>
Instead of looking at a toy GRanges, let’s download some actual genomic ranges, the set of human genes, as annotated by the Ensembl project, version 86.
This data is easily accessible in Bioconductor using the ensembldb package, and the specific Ensembl database package from Bioconductor.
This says there are about 60 thousand genes (according to Ensembl).
library(ensembldb)
library(EnsDb.Hsapiens.v86) # this pkg is about 75 Mb
edb <- EnsDb.Hsapiens.v86
g <- genes(edb)
length(g)
## [1] 63970
If you try to print g
it will just show you the
beginning and end:
g
## GRanges object with 63970 ranges and 6 metadata columns:
## seqnames ranges strand | gene_id gene_name
## <Rle> <IRanges> <Rle> | <character> <character>
## ENSG00000223972 1 11869-14409 + | ENSG00000223972 DDX11L1
## ENSG00000227232 1 14404-29570 - | ENSG00000227232 WASH7P
## ENSG00000278267 1 17369-17436 - | ENSG00000278267 MIR6859-1
## ENSG00000243485 1 29554-31109 + | ENSG00000243485 MIR1302-2
## ENSG00000237613 1 34554-36081 - | ENSG00000237613 FAM138A
## ... ... ... ... . ... ...
## ENSG00000224240 Y 26549425-26549743 + | ENSG00000224240 CYCSP49
## ENSG00000227629 Y 26586642-26591601 - | ENSG00000227629 SLC25A15P1
## ENSG00000237917 Y 26594851-26634652 - | ENSG00000237917 PARP4P1
## ENSG00000231514 Y 26626520-26627159 - | ENSG00000231514 FAM58CP
## ENSG00000235857 Y 56855244-56855488 + | ENSG00000235857 CTBP2P1
## gene_biotype seq_coord_system symbol
## <character> <character> <character>
## ENSG00000223972 transcribed_unproces.. chromosome DDX11L1
## ENSG00000227232 unprocessed_pseudogene chromosome WASH7P
## ENSG00000278267 miRNA chromosome MIR6859-1
## ENSG00000243485 lincRNA chromosome MIR1302-2
## ENSG00000237613 lincRNA chromosome FAM138A
## ... ... ... ...
## ENSG00000224240 processed_pseudogene chromosome CYCSP49
## ENSG00000227629 unprocessed_pseudogene chromosome SLC25A15P1
## ENSG00000237917 unprocessed_pseudogene chromosome PARP4P1
## ENSG00000231514 processed_pseudogene chromosome FAM58CP
## ENSG00000235857 processed_pseudogene chromosome CTBP2P1
## entrezid
## <list>
## ENSG00000223972 100287596,100287102,727856,...
## ENSG00000227232 <NA>
## ENSG00000278267 102466751
## ENSG00000243485 105376912,100302278
## ENSG00000237613 654835,645520,641702
## ... ...
## ENSG00000224240 <NA>
## ENSG00000227629 <NA>
## ENSG00000237917 <NA>
## ENSG00000231514 <NA>
## ENSG00000235857 <NA>
## -------
## seqinfo: 357 sequences (1 circular) from GRCh38 genome
You can take a look at a single gene:
g[1]
## GRanges object with 1 range and 6 metadata columns:
## seqnames ranges strand | gene_id gene_name gene_biotype
## <Rle> <IRanges> <Rle> | <character> <character> <character>
## ENSG00000223972 1 11869-14409 + | ENSG00000223972 DDX11L1 transcribed_unproces..
## seq_coord_system symbol entrezid
## <character> <character> <list>
## ENSG00000223972 chromosome DDX11L1 100287596,100287102,727856,...
## -------
## seqinfo: 357 sequences (1 circular) from GRCh38 genome
To pull out the gene names you use a $
. This would give
us a 60 thousand length character vector, so we just show the first
5.
g$gene_name[1:5]
## [1] "DDX11L1" "WASH7P" "MIR6859-1" "MIR1302-2" "FAM138A"
We can look at the distribution of genes by chromosome. The
seqnames
returns a factor-Rle where Rle
stands for run length encoding. This is a way to compress a
long vector which may contain repeated values in runs.
seqnames(g)
## factor-Rle of length 63970 with 357 runs
## Lengths: 5264 2232 ... 523
## Values : 1 10 ... Y
## Levels(357): 1 10 11 12 13 14 15 16 17 18 ... LRG_187 LRG_239 LRG_311 LRG_721 LRG_741 LRG_93 MT X Y
The object g
has information about genes on the standard
chromosomes, as well as genes or haplotypes of genes on
haplotype chromosomes.
head(seqlevels(g),100)
## [1] "1" "10" "11"
## [4] "12" "13" "14"
## [7] "15" "16" "17"
## [10] "18" "19" "2"
## [13] "20" "21" "22"
## [16] "3" "4" "5"
## [19] "6" "7" "8"
## [22] "9" "CHR_HG107_PATCH" "CHR_HG126_PATCH"
## [25] "CHR_HG1311_PATCH" "CHR_HG1342_HG2282_PATCH" "CHR_HG1362_PATCH"
## [28] "CHR_HG142_HG150_NOVEL_TEST" "CHR_HG151_NOVEL_TEST" "CHR_HG1651_PATCH"
## [31] "CHR_HG1832_PATCH" "CHR_HG2021_PATCH" "CHR_HG2022_PATCH"
## [34] "CHR_HG2023_PATCH" "CHR_HG2030_PATCH" "CHR_HG2058_PATCH"
## [37] "CHR_HG2062_PATCH" "CHR_HG2063_PATCH" "CHR_HG2066_PATCH"
## [40] "CHR_HG2072_PATCH" "CHR_HG2095_PATCH" "CHR_HG2104_PATCH"
## [43] "CHR_HG2116_PATCH" "CHR_HG2128_PATCH" "CHR_HG2191_PATCH"
## [46] "CHR_HG2213_PATCH" "CHR_HG2217_PATCH" "CHR_HG2232_PATCH"
## [49] "CHR_HG2233_PATCH" "CHR_HG2235_PATCH" "CHR_HG2239_PATCH"
## [52] "CHR_HG2247_PATCH" "CHR_HG2249_PATCH" "CHR_HG2288_HG2289_PATCH"
## [55] "CHR_HG2290_PATCH" "CHR_HG2291_PATCH" "CHR_HG2334_PATCH"
## [58] "CHR_HG26_PATCH" "CHR_HG986_PATCH" "CHR_HSCHR10_1_CTG1"
## [61] "CHR_HSCHR10_1_CTG2" "CHR_HSCHR10_1_CTG3" "CHR_HSCHR10_1_CTG4"
## [64] "CHR_HSCHR10_1_CTG6" "CHR_HSCHR11_1_CTG1_2" "CHR_HSCHR11_1_CTG5"
## [67] "CHR_HSCHR11_1_CTG6" "CHR_HSCHR11_1_CTG7" "CHR_HSCHR11_1_CTG8"
## [70] "CHR_HSCHR11_2_CTG1" "CHR_HSCHR11_2_CTG1_1" "CHR_HSCHR11_3_CTG1"
## [73] "CHR_HSCHR12_1_CTG1" "CHR_HSCHR12_1_CTG2_1" "CHR_HSCHR12_2_CTG1"
## [76] "CHR_HSCHR12_2_CTG2" "CHR_HSCHR12_2_CTG2_1" "CHR_HSCHR12_3_CTG2"
## [79] "CHR_HSCHR12_3_CTG2_1" "CHR_HSCHR12_4_CTG2" "CHR_HSCHR12_4_CTG2_1"
## [82] "CHR_HSCHR12_5_CTG2" "CHR_HSCHR12_5_CTG2_1" "CHR_HSCHR12_6_CTG2_1"
## [85] "CHR_HSCHR13_1_CTG1" "CHR_HSCHR13_1_CTG3" "CHR_HSCHR13_1_CTG5"
## [88] "CHR_HSCHR13_1_CTG8" "CHR_HSCHR14_1_CTG1" "CHR_HSCHR14_2_CTG1"
## [91] "CHR_HSCHR14_3_CTG1" "CHR_HSCHR14_7_CTG1" "CHR_HSCHR15_1_CTG1"
## [94] "CHR_HSCHR15_1_CTG3" "CHR_HSCHR15_1_CTG8" "CHR_HSCHR15_2_CTG3"
## [97] "CHR_HSCHR15_2_CTG8" "CHR_HSCHR15_3_CTG3" "CHR_HSCHR15_3_CTG8"
## [100] "CHR_HSCHR15_4_CTG8"
Let’s subset to just the “standard” chromosomes:
g <- keepStandardChromosomes(g, pruning.mode="coarse")
The following functions tell you the names and lengths of chromosomes, as well as the name of the genome (repeated for each chromosome).
seqlevels(g)
## [1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "20" "21" "22" "3" "4" "5" "6"
## [20] "7" "8" "9" "MT" "X" "Y"
seqinfo(g)
## Seqinfo object with 25 sequences (1 circular) from GRCh38 genome:
## seqnames seqlengths isCircular genome
## 1 248956422 FALSE GRCh38
## 10 133797422 FALSE GRCh38
## 11 135086622 FALSE GRCh38
## 12 133275309 FALSE GRCh38
## 13 114364328 FALSE GRCh38
## ... ... ... ...
## 8 145138636 FALSE GRCh38
## 9 138394717 FALSE GRCh38
## MT 16569 TRUE GRCh38
## X 156040895 FALSE GRCh38
## Y 57227415 FALSE GRCh38
genome(g)
## 1 10 11 12 13 14 15 16 17 18 19
## "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38"
## 2 20 21 22 3 4 5 6 7 8 9
## "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38" "GRCh38"
## MT X Y
## "GRCh38" "GRCh38" "GRCh38"
We can get the number of genes on each chromosome:
table(seqnames(g))
##
## 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7
## 5264 2232 3289 2969 1316 2244 2171 2538 3039 1181 2962 4024 1403 848 1337 3043 2519 2891 2888 2893
## 8 9 MT X Y
## 2369 2271 37 2399 523
Note that MT (mitochondrial DNA) has many more genes than the other chromosomes, per basepair:
tab <- table(seqnames(g))
tab
##
## 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7
## 5264 2232 3289 2969 1316 2244 2171 2538 3039 1181 2962 4024 1403 848 1337 3043 2519 2891 2888 2893
## 8 9 MT X Y
## 2369 2271 37 2399 523
lens <- seqlengths(seqinfo(g))
lens
## 1 10 11 12 13 14 15 16 17 18
## 248956422 133797422 135086622 133275309 114364328 107043718 101991189 90338345 83257441 80373285
## 19 2 20 21 22 3 4 5 6 7
## 58617616 242193529 64444167 46709983 50818468 198295559 190214555 181538259 170805979 159345973
## 8 9 MT X Y
## 145138636 138394717 16569 156040895 57227415
all.equal(names(tab), names(lens))
## [1] TRUE
barplot(tab/lens)
If we remove MT, we can see the range better:
barplot((tab/lens)[-which(names(tab) == "MT")], horiz=TRUE, las=1)
abline(v=(1:5)*1e-5, col=rgb(0,0,0,.5))
So roughly 1 gene per 50,000 basepairs, averaging per chromosome.
mean((lens/tab)[-which(names(tab) == "MT")])
## [1] 55799.91
Finally, I want to introduce some functionality we will use a couple times later in the course: finding overlaps of sets of genomic ranges. One scenario is that we have two sets and we want to know which ranges from the two sets overlap each other. We might make a query set and ask which genes are overlapping the set in our query.
query <- GRanges(1, IRanges((101:110)*1e6 + 1,width=1e6))
query
## GRanges object with 10 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 1 101000001-102000000 *
## [2] 1 102000001-103000000 *
## [3] 1 103000001-104000000 *
## [4] 1 104000001-105000000 *
## [5] 1 105000001-106000000 *
## [6] 1 106000001-107000000 *
## [7] 1 107000001-108000000 *
## [8] 1 108000001-109000000 *
## [9] 1 109000001-110000000 *
## [10] 1 110000001-111000000 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We can get the numeric indices of the overlaps using
findOverlaps
. Note that we can turn this resulting object
into a data.frame, for further processing.
fo <- findOverlaps(query, g)
head(fo)
## Hits object with 6 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 2333
## [2] 1 2335
## [3] 1 2336
## [4] 1 2337
## [5] 1 2338
## [6] 1 2339
## -------
## queryLength: 10 / subjectLength: 58650
tail(fo)
## Hits object with 6 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 10 2490
## [2] 10 2491
## [3] 10 2492
## [4] 10 2493
## [5] 10 2494
## [6] 10 2495
## -------
## queryLength: 10 / subjectLength: 58650
# to count the overlaps for each range in query:
table(factor(from(fo), seq_along(query)))
##
## 1 2 3 4 5 6 7 8 9 10
## 17 3 15 2 7 8 6 29 44 36
Equivalently, we can use the countOverlaps
function for
the last command above:
countOverlaps(query, g)
## [1] 17 3 15 2 7 8 6 29 44 36
We can convert the Hits object to a data.frame:
fo <- as.data.frame(fo)
We can also find out which genes overlap which other genes, by just
providing a single set of ranges to findOverlaps
. There are
many overlapping genes in the human genome, though you should
note that we obtain at least two rows in the resulting Hits
object for every overlap, and so we can filter this object down by
insisting that one index be strictly less than the other.
fo <- findOverlaps(g)
head(fo)
## SelfHits object with 6 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 1
## [2] 2 2
## [3] 2 3
## [4] 3 2
## [5] 3 3
## [6] 4 4
## -------
## queryLength: 58650 / subjectLength: 58650
fo <- as.data.frame(fo)
fo <- fo[fo[,1] < fo[,2],]
head(fo)
## queryHits subjectHits
## 3 2 3
## 12 9 10
## 20 15 16
## 26 19 20
## 30 21 22
## 37 26 28
nrow(fo)
## [1] 12078
Earlier, when we defined query, we didn’t consider strand,
and so the strand of the ranges was defined as *
. This
means that we get overlaps of any genes whether + or -. But when we ask
whether the genes overlap each other, by default
findOverlaps
will not count an overlap if the ranges are on
different strand. To ask how many genes overlap each other ignoring
strand, we need to specify ignore.strand=TRUE
(which
produces more overlaps).
fo <- findOverlaps(g, ignore.strand=TRUE)
fo <- as.data.frame(fo)
fo <- fo[fo[,1] < fo[,2],]
nrow(fo)
## [1] 29470
If we just wanted to subset to the genes which overlap a given range,
we can use overlapsAny
:
g[overlapsAny(g, query[1])]
## GRanges object with 17 ranges and 6 metadata columns:
## seqnames ranges strand | gene_id gene_name
## <Rle> <IRanges> <Rle> | <character> <character>
## ENSG00000117543 1 100989623-101026088 - | ENSG00000117543 DPH5
## ENSG00000233184 1 101025878-101087268 + | ENSG00000233184 RP11-421L21.3
## ENSG00000252765 1 101133153-101133339 + | ENSG00000252765 SCARNA16
## ENSG00000271578 1 101190520-101190994 + | ENSG00000271578 RP5-1033H2.1
## ENSG00000225938 1 101235683-101236528 - | ENSG00000225938 RP4-575N6.4
## ... ... ... ... . ... ...
## ENSG00000183298 1 101786340-101787219 - | ENSG00000183298 RPSAP19
## ENSG00000118733 1 101802574-101997030 - | ENSG00000118733 OLFM3
## ENSG00000252717 1 101859851-101859957 + | ENSG00000252717 RNU6-352P
## ENSG00000271277 1 101882516-101882894 + | ENSG00000271277 RP11-82O2.2
## ENSG00000162699 1 101893105-101894047 + | ENSG00000162699 DNAJA1P5
## gene_biotype seq_coord_system symbol entrezid
## <character> <character> <character> <list>
## ENSG00000117543 protein_coding chromosome DPH5 51611
## ENSG00000233184 antisense chromosome RP11-421L21.3 <NA>
## ENSG00000252765 scaRNA chromosome SCARNA16 <NA>
## ENSG00000271578 processed_pseudogene chromosome RP5-1033H2.1 <NA>
## ENSG00000225938 antisense chromosome RP4-575N6.4 <NA>
## ... ... ... ... ...
## ENSG00000183298 processed_pseudogene chromosome RPSAP19 <NA>
## ENSG00000118733 protein_coding chromosome OLFM3 118427
## ENSG00000252717 snRNA chromosome RNU6-352P <NA>
## ENSG00000271277 processed_pseudogene chromosome RP11-82O2.2 <NA>
## ENSG00000162699 processed_pseudogene chromosome DNAJA1P5 <NA>
## -------
## seqinfo: 25 sequences (1 circular) from GRCh38 genome
This is equivalent to the following:
g[g %over% query[1]]
## GRanges object with 17 ranges and 6 metadata columns:
## seqnames ranges strand | gene_id gene_name
## <Rle> <IRanges> <Rle> | <character> <character>
## ENSG00000117543 1 100989623-101026088 - | ENSG00000117543 DPH5
## ENSG00000233184 1 101025878-101087268 + | ENSG00000233184 RP11-421L21.3
## ENSG00000252765 1 101133153-101133339 + | ENSG00000252765 SCARNA16
## ENSG00000271578 1 101190520-101190994 + | ENSG00000271578 RP5-1033H2.1
## ENSG00000225938 1 101235683-101236528 - | ENSG00000225938 RP4-575N6.4
## ... ... ... ... . ... ...
## ENSG00000183298 1 101786340-101787219 - | ENSG00000183298 RPSAP19
## ENSG00000118733 1 101802574-101997030 - | ENSG00000118733 OLFM3
## ENSG00000252717 1 101859851-101859957 + | ENSG00000252717 RNU6-352P
## ENSG00000271277 1 101882516-101882894 + | ENSG00000271277 RP11-82O2.2
## ENSG00000162699 1 101893105-101894047 + | ENSG00000162699 DNAJA1P5
## gene_biotype seq_coord_system symbol entrezid
## <character> <character> <character> <list>
## ENSG00000117543 protein_coding chromosome DPH5 51611
## ENSG00000233184 antisense chromosome RP11-421L21.3 <NA>
## ENSG00000252765 scaRNA chromosome SCARNA16 <NA>
## ENSG00000271578 processed_pseudogene chromosome RP5-1033H2.1 <NA>
## ENSG00000225938 antisense chromosome RP4-575N6.4 <NA>
## ... ... ... ... ...
## ENSG00000183298 processed_pseudogene chromosome RPSAP19 <NA>
## ENSG00000118733 protein_coding chromosome OLFM3 118427
## ENSG00000252717 snRNA chromosome RNU6-352P <NA>
## ENSG00000271277 processed_pseudogene chromosome RP11-82O2.2 <NA>
## ENSG00000162699 processed_pseudogene chromosome DNAJA1P5 <NA>
## -------
## seqinfo: 25 sequences (1 circular) from GRCh38 genome
As with SummarizedExperiment, there is a “tidy” way to work with GRanges, called plyranges. For more examples of this type of analysis, take a look at some tutorial content I’ve been working on here:
https://nullranges.github.io/tidy-ranges-tutorial
library(plyranges)
g %>%
group_by(gene_biotype) %>%
summarize(width = mean(width))
## DataFrame with 46 rows and 2 columns
## gene_biotype width
## <character> <numeric>
## 1 3prime_overlapping_n.. 7001.43
## 2 antisense 20910.71
## 3 bidirectional_promot.. 91166.25
## 4 IG_C_gene 2365.29
## 5 IG_C_pseudogene 1012.44
## ... ... ...
## 42 transcribed_unitary_.. 50086.47
## 43 transcribed_unproces.. 28365.94
## 44 unitary_pseudogene 15708.17
## 45 unprocessed_pseudogene 5417.64
## 46 vaultRNA 127.00
Count overlaps of genes in the query ranges:
query %>%
mutate(num_genes = count_overlaps(., g))
## GRanges object with 10 ranges and 1 metadata column:
## seqnames ranges strand | num_genes
## <Rle> <IRanges> <Rle> | <integer>
## [1] 1 101000001-102000000 * | 17
## [2] 1 102000001-103000000 * | 3
## [3] 1 103000001-104000000 * | 15
## [4] 1 104000001-105000000 * | 2
## [5] 1 105000001-106000000 * | 7
## [6] 1 106000001-107000000 * | 8
## [7] 1 107000001-108000000 * | 6
## [8] 1 108000001-109000000 * | 29
## [9] 1 109000001-110000000 * | 44
## [10] 1 110000001-111000000 * | 36
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We can also do the same types of operations as before, with “base GRanges”:
g %>%
slice(1:5) %>%
select(symbol) %>%
anchor_5p() %>%
mutate(width=1) # same as resize(..., width=1)
## GRanges object with 5 ranges and 1 metadata column:
## seqnames ranges strand | symbol
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000223972 1 11869 + | DDX11L1
## ENSG00000227232 1 29570 - | WASH7P
## ENSG00000278267 1 17436 - | MIR6859-1
## ENSG00000243485 1 29554 + | MIR1302-2
## ENSG00000237613 1 36081 - | FAM138A
## -------
## seqinfo: 25 sequences (1 circular) from GRCh38 genome
Plot genes per chromosome:
library(tibble)
library(ggplot2)
g %>%
group_by(seqnames) %>%
summarize(num_genes = n()) %>%
as_tibble() %>%
ggplot(aes(seqnames, num_genes)) +
geom_col()
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] ggplot2_3.3.6 tibble_3.1.7 plyranges_1.17.0
## [4] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.21.2 AnnotationFilter_1.21.0
## [7] GenomicFeatures_1.49.5 AnnotationDbi_1.59.1 Biobase_2.57.1
## [10] rafalib_1.0.0 GenomicRanges_1.49.0 GenomeInfoDb_1.33.3
## [13] IRanges_2.31.0 S4Vectors_0.35.1 BiocGenerics_0.43.0
## [16] pkgdown_2.0.5 testthat_3.1.4 rmarkdown_2.14
## [19] devtools_2.4.3 usethis_2.1.6
##
## loaded via a namespace (and not attached):
## [1] ProtGenerics_1.29.0 matrixStats_0.62.0 bitops_1.0-7
## [4] fs_1.5.2 bit64_4.0.5 filelock_1.0.2
## [7] RColorBrewer_1.1-3 progress_1.2.2 httr_1.4.3
## [10] tools_4.2.1 bslib_0.3.1 utf8_1.2.2
## [13] R6_2.5.1 colorspace_2.0-3 lazyeval_0.2.2
## [16] DBI_1.1.3 withr_2.5.0 tidyselect_1.1.2
## [19] prettyunits_1.1.1 processx_3.6.1 bit_4.0.4
## [22] curl_4.3.2 compiler_4.2.1 cli_3.3.0
## [25] xml2_1.3.3 DelayedArray_0.23.0 labeling_0.4.2
## [28] rtracklayer_1.57.0 sass_0.4.1 scales_1.2.0
## [31] callr_3.7.0 rappdirs_0.3.3 stringr_1.4.0
## [34] digest_0.6.29 Rsamtools_2.13.3 XVector_0.37.0
## [37] pkgconfig_2.0.3 htmltools_0.5.2 sessioninfo_1.2.2
## [40] MatrixGenerics_1.9.1 dbplyr_2.2.1 fastmap_1.1.0
## [43] highr_0.9 rlang_1.0.3 rstudioapi_0.13
## [46] RSQLite_2.2.14 farver_2.1.1 jquerylib_0.1.4
## [49] BiocIO_1.7.1 generics_0.1.3 jsonlite_1.8.0
## [52] BiocParallel_1.31.10 dplyr_1.0.9 RCurl_1.98-1.7
## [55] magrittr_2.0.3 GenomeInfoDbData_1.2.8 Matrix_1.4-1
## [58] munsell_0.5.0 Rcpp_1.0.8.3 fansi_1.0.3
## [61] lifecycle_1.0.1 stringi_1.7.6 yaml_2.3.5
## [64] SummarizedExperiment_1.27.1 zlibbioc_1.43.0 brio_1.1.3
## [67] pkgbuild_1.3.1 BiocFileCache_2.5.0 grid_4.2.1
## [70] blob_1.2.3 parallel_4.2.1 crayon_1.5.1
## [73] lattice_0.20-45 Biostrings_2.65.1 hms_1.1.1
## [76] KEGGREST_1.37.2 knitr_1.39 ps_1.7.1
## [79] pillar_1.7.0 rjson_0.2.21 codetools_0.2-18
## [82] biomaRt_2.53.2 pkgload_1.3.0 XML_3.99-0.10
## [85] glue_1.6.2 evaluate_0.15 remotes_2.4.2
## [88] png_0.1-7 vctrs_0.4.1 gtable_0.3.0
## [91] purrr_0.3.4 assertthat_0.2.1 cachem_1.0.6
## [94] xfun_0.31 restfulr_0.0.15 GenomicAlignments_1.33.0
## [97] memoise_2.0.1 ellipsis_0.3.2