Here we will show how to use AnnotationHub and annotation packages in Bioconductor to access both experimental data, and information about things like genes. AnnotationHub is a relatively new method for accessing experimental data and annotation information (introduced in 2013). Previously, the main ways to access annotation information were to download a large package with all the information contained, or to query a web resource called BioMart using the biomaRt Bioconductor package. AnnotationHub makes it much easier to look up and download annotation information within R.
One important usage note: AnnotationHub creates a hidden
folder in your home directory, ~/.AnnotationHub
which
accumulates the data objects you download with the package. In the next
topic, we will show how to download the human genome which is ~700 Mb,
and so you should know that it is sitting in this hidden folder, in case
you need to clear up space later.
library("AnnotationHub")
ah <- AnnotationHub()
One way to see what kind of information is available is to put up an
HTML browser using the display
function. Search for
species: “Homo sapiens” and
description: “chip-seq”. (I’ve noticed that using the
global search is much slower than using the individual fields.) When you
are finished, and want to return to R, you should hit Esc
(or Ctrl-C Ctrl-C
in an R terminal).
display(ah)
We can look up specific datasets of interest, either by selecting
them in the web browser and telling the browser to return these rows to
R, or by using the single square bracket with the AH
code.
This just tells us about the record:
ah["AH28856"]
## AnnotationHub with 1 record
## # snapshotDate(): 2022-08-12
## # names(): AH28856
## # $dataprovider: BroadInstitute
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # $rdatadateadded: 2015-05-06
## # $title: E001-H3K4me1.broadPeak.gz
## # $description: Broad ChIP-seq peaks for consolidated epigenomes from E...
## # $taxonomyid: 9606
## # $genome: hg19
## # $sourcetype: BED
## # $sourceurl: http://egg2.wustl.edu/roadmap/data/byFileType/peaks/conso...
## # $sourcesize: 7519722
## # $tags: c("EpigenomeRoadMap", "peaks", "consolidated",
## # "broadPeak", "E001", "ESC", "ESC.I3", "ES-I3 Cells")
## # retrieve record with 'object[["AH28856"]]'
If we want to actually download this dataset (a set of ChIP-seq peaks for the histone modification H3K4me1), we use the double square brackets. The data are the set of locations in the genome where the histone modification was detected. Each range is associated with a score, a “signal value”, a p-value and a q-value.
peaks <- ah[["AH28856"]]
## loading from cache
peaks
## GRanges object with 355118 ranges and 5 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr3 195423087-195426700 * | Rank_1 78
## [2] chr17 41080552-41081406 * | Rank_2 72
## [3] chr16 65269248-65270757 * | Rank_3 71
## [4] chr10 46973856-46977224 * | Rank_4 70
## [5] chr17 48425051-48426662 * | Rank_5 69
## ... ... ... ... . ... ...
## [355114] chr6 100602736-100603151 * | Rank_355114 0
## [355115] chr6 18251768-18251961 * | Rank_355115 0
## [355116] chrX 21444665-21445036 * | Rank_355116 0
## [355117] chr17 77913489-77913735 * | Rank_355117 0
## [355118] chr4 6712348-6712652 * | Rank_355118 0
## signalValue pValue qValue
## <numeric> <numeric> <numeric>
## [1] 3.23874 10.67458 7.87721
## [2] 3.70866 9.91354 7.23544
## [3] 4.42401 9.75522 7.11566
## [4] 3.00554 9.70745 7.07189
## [5] 3.91848 9.56466 6.93256
## ... ... ... ...
## [355114] 1.46240 1.00757 0.04592
## [355115] 1.46240 1.00757 0.04592
## [355116] 1.46240 1.00757 0.04592
## [355117] 1.41873 1.00674 0.04497
## [355118] 1.41583 1.00665 0.04495
## -------
## seqinfo: 298 sequences (2 circular) from hg19 genome
We have the chromosome and genome information as well:
seqinfo(peaks)
## Seqinfo object with 298 sequences (2 circular) from hg19 genome:
## seqnames seqlengths isCircular genome
## chr1 249250621 FALSE hg19
## chr2 243199373 FALSE hg19
## chr3 198022430 FALSE hg19
## chr4 191154276 FALSE hg19
## chr5 180915260 FALSE hg19
## ... ... ... ...
## chrUn_gl000245 36651 FALSE hg19
## chrUn_gl000246 38154 FALSE hg19
## chrUn_gl000247 36422 FALSE hg19
## chrUn_gl000248 39786 FALSE hg19
## chrUn_gl000249 38502 FALSE hg19
Another type of annotation we can pull down using AnnotationHub is an OrgDb or organism database. We subset the hub to only those objects that are of type OrgDb, and then we query for Homo sapiens.
orgs <- subset(ah, ah$rdataclass == "OrgDb")
orgdb <- query(orgs, "Homo sapiens")[[1]]
## loading from cache
We can see the OrgDb bundles a lot of information from various sources (Entrez, GO, KEGG, UCSC, Ensembl, Uniprot).
orgdb
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: HUMAN_DB
## | ORGANISM: Homo sapiens
## | SPECIES: Human
## | EGSOURCEDATE: 2022-Mar17
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | CENTRALID: EG
## | TAXID: 9606
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
## | GOSOURCEDATE: 2022-03-10
## | GOEGSOURCEDATE: 2022-Mar17
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
## | GPSOURCEURL:
## | GPSOURCEDATE: 2022-Nov23
## | ENSOURCEDATE: 2021-Dec21
## | ENSOURCENAME: Ensembl
## | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
## | UPSOURCENAME: Uniprot
## | UPSOURCEURL: http://www.UniProt.org/
## | UPSOURCEDATE: Fri Apr 1 14:42:16 2022
##
## Please see: help('select') for usage information
The columns
function tells us what identifiers are
contained in the database, where the elements of the database are genes
or associated proteins.
columns(orgdb)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
## [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
## [9] "EVIDENCEALL" "GENENAME" "GENETYPE" "GO"
## [13] "GOALL" "IPI" "MAP" "OMIM"
## [17] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL"
## [25] "UCSCKG" "UNIPROT"
I typically like to use SFTPB as an example gene. If you follow the link you can see it’s expression profile in humans across various tissues. It’s a gene that is almost exclusively expressed in lung, and it plays a role in producing surfactant, which reduces the surface tension in the alveoli in the lung. The lung is like a big balloon, but with many small sacks called alveoli, which produce a massive amount of surface area in a small volume. Without surfactant, we wouldn’t have the muscle power to expand our lungs.
select
We can look up more information about SFTPB using
select
. We provide a key, which is the gene symbol, and ask
for the gene name, a column of orgdb
. Note that
select
returns a data.frame with the key
and
the columns
(which could have been more than one):
select(orgdb, keys="SFTPB", columns="GENENAME", keytype="SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
## SYMBOL GENENAME
## 1 SFTPB surfactant protein B
mapIds
We can also map across different identifiers, e.g.:
select(orgdb, keys="SFTPB", columns="ENSEMBL", keytype="SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
## SYMBOL ENSEMBL
## 1 SFTPB ENSG00000168878
select(orgdb, keys="ENSG00000168878", columns="SYMBOL", keytype="ENSEMBL")
## 'select()' returned 1:1 mapping between keys and columns
## ENSEMBL SYMBOL
## 1 ENSG00000168878 SFTPB
Sometimes, we don’t want a table but just a vector of IDs. For this,
we can use the function mapIds
:
k <- c("SFTPB","NR3C1","NFKB1")
mapIds(orgdb, keys=k,
column="ENSEMBL", keytype="SYMBOL",
multiVals="asNA")
## 'select()' returned 1:1 mapping between keys and columns
## SFTPB NR3C1 NFKB1
## "ENSG00000168878" "ENSG00000113580" "ENSG00000109320"
If a key matches to multiple values for column
, then you
can choose what to do (see options in ?mapIds
). Above the
function will return an NA
, but it is also possible to
return the first
element, or a list
instead of
a vector, or a CharacterList
etc. You can even define a
FUN
function to perform the filtering in case of multiple
values.
We can find out what roles a gene play, by looking up the Gene Ontology terms, or GO terms:
go <- select(orgdb, "SFTPB", "GO", "SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
go
## SYMBOL GO EVIDENCE ONTOLOGY
## 1 SFTPB GO:0005515 IPI MF
## 2 SFTPB GO:0005576 TAS CC
## 3 SFTPB GO:0005615 IBA CC
## 4 SFTPB GO:0005764 IEA CC
## 5 SFTPB GO:0005771 IBA CC
## 6 SFTPB GO:0005789 TAS CC
## 7 SFTPB GO:0006665 IEA BP
## 8 SFTPB GO:0007585 IEA BP
## 9 SFTPB GO:0009887 TAS BP
## 10 SFTPB GO:0042599 TAS CC
## 11 SFTPB GO:0045334 TAS CC
## 12 SFTPB GO:0097208 IBA CC
## 13 SFTPB GO:0097486 TAS CC
These are not informative alone, just, e.g. GO:0001664
.
Let’s restict to the terms that represent “biological processes”:
go <- go[go$ONTOLOGY == "BP",] # biological processes
go
## SYMBOL GO EVIDENCE ONTOLOGY
## 7 SFTPB GO:0006665 IEA BP
## 8 SFTPB GO:0007585 IEA BP
## 9 SFTPB GO:0009887 TAS BP
Now we need to look up what the GO terms mean, using a separate
database, called GO.db
:
library(GO.db)
columns(GO.db)
## [1] "DEFINITION" "GOID" "ONTOLOGY" "TERM"
go2 <- select(GO.db, go$GO, c("TERM","DEFINITION"), "GOID")
## 'select()' returned 1:1 mapping between keys and columns
go2$TERM
## [1] "sphingolipid metabolic process"
## [2] "respiratory gaseous exchange by respiratory system"
## [3] "animal organ morphogenesis"
As expected, one of the terms has to do with respiratory gaseous exchange. Also, the surfactant is made up of many lipids, so the first term is not surprising. Sometimes very generic categories show up here, so it’s relevant to know how many genes a GO term is associated with. Here we will use the Ensembl IDs for genes.
sapply(go$GO, function(term) nrow(select(orgdb, term, "ENSEMBL", "GO")))
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## GO:0006665 GO:0007585 GO:0009887
## 25 29 149
Another gene I often use as an example is the glucocorticoid receptor, which is the protein that responds to the hormone cortisol, the “stress hormone”. We can repeat these lines of code with the gene symbol for glucocorticoid receptor (see the alias “GR”):
select(orgdb, "NR3C1", "GENENAME", "SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
## SYMBOL GENENAME
## 1 NR3C1 nuclear receptor subfamily 3 group C member 1
select(orgdb, "NR3C1", "ALIAS", "SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
## SYMBOL ALIAS
## 1 NR3C1 GCCR
## 2 NR3C1 GCR
## 3 NR3C1 GCRST
## 4 NR3C1 GR
## 5 NR3C1 GRL
## 6 NR3C1 NR3C1
go <- select(orgdb, "NR3C1", "GO", "SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
go <- go[go$ONTOLOGY == "BP",] # biological processes
go2 <- select(GO.db, go$GO, c("TERM","DEFINITION"), "GOID")
## 'select()' returned many:1 mapping between keys and columns
go2$TERM
## [1] "negative regulation of transcription by RNA polymerase II"
## [2] "negative regulation of transcription by RNA polymerase II"
## [3] "chromatin organization"
## [4] "regulation of transcription, DNA-templated"
## [5] "regulation of transcription by RNA polymerase II"
## [6] "apoptotic process"
## [7] "cell cycle"
## [8] "chromosome segregation"
## [9] "signal transduction"
## [10] "intracellular steroid hormone receptor signaling pathway"
## [11] "glucocorticoid receptor signaling pathway"
## [12] "glucocorticoid mediated signaling pathway"
## [13] "negative regulation of transcription, DNA-templated"
## [14] "positive regulation of transcription by RNA polymerase II"
## [15] "positive regulation of transcription by RNA polymerase II"
## [16] "cell division"
## [17] "cellular response to steroid hormone stimulus"
## [18] "cellular response to glucocorticoid stimulus"
## [19] "cellular response to dexamethasone stimulus"
## [20] "cellular response to transforming growth factor beta stimulus"
## [21] "positive regulation of miRNA transcription"
As we showed in the GRangesList lecture note how genes can have different isoforms (or “transcripts”), there are also annotation resources for understanding the differences between these isoforms. A package IsoformSwitchAnalyzeR provides a pipeline for analyzing the consequence of changes between isoforms.
In particular see the section of the vignette, Predicting switch consequences.
This package provides a function called switchPlot
, that
shows:
The isoform structures along with the concatenated annotations (including transcript classification, ORF, Coding Potential, NMD sensitivity, annotated protein domains as well as annotated signal peptides)
Gene and isoform expression and any annotated differential expression analysis of these
Isoform usage including the result of the isoform switch test
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 11.6.8
##
## Matrix products: default
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices datasets utils methods
## [8] base
##
## other attached packages:
## [1] GO.db_3.15.0 rtracklayer_1.57.0 GenomicRanges_1.49.0
## [4] GenomeInfoDb_1.33.3 AnnotationHub_3.5.0 BiocFileCache_2.5.0
## [7] dbplyr_2.2.1 AnnotationDbi_1.59.1 IRanges_2.31.0
## [10] S4Vectors_0.35.1 Biobase_2.57.1 BiocGenerics_0.43.0
## [13] pkgdown_2.0.5 testthat_3.1.4 rmarkdown_2.14
## [16] devtools_2.4.3 usethis_2.1.6
##
## loaded via a namespace (and not attached):
## [1] matrixStats_0.62.0 bitops_1.0-7
## [3] fs_1.5.2 bit64_4.0.5
## [5] filelock_1.0.2 httr_1.4.3
## [7] tools_4.2.1 bslib_0.3.1
## [9] utf8_1.2.2 R6_2.5.1
## [11] DBI_1.1.3 tidyselect_1.1.2
## [13] prettyunits_1.1.1 processx_3.6.1
## [15] bit_4.0.4 curl_4.3.2
## [17] compiler_4.2.1 cli_3.3.0
## [19] DelayedArray_0.23.0 sass_0.4.1
## [21] callr_3.7.0 rappdirs_0.3.3
## [23] stringr_1.4.0 digest_0.6.29
## [25] Rsamtools_2.13.3 XVector_0.37.0
## [27] pkgconfig_2.0.3 htmltools_0.5.2
## [29] sessioninfo_1.2.2 MatrixGenerics_1.9.1
## [31] BSgenome_1.65.2 fastmap_1.1.0
## [33] rlang_1.0.3 rstudioapi_0.13
## [35] RSQLite_2.2.14 shiny_1.7.1
## [37] jquerylib_0.1.4 BiocIO_1.7.1
## [39] generics_0.1.3 jsonlite_1.8.0
## [41] BiocParallel_1.31.10 dplyr_1.0.9
## [43] RCurl_1.98-1.7 magrittr_2.0.3
## [45] GenomeInfoDbData_1.2.8 Matrix_1.4-1
## [47] Rcpp_1.0.8.3 fansi_1.0.3
## [49] lifecycle_1.0.1 stringi_1.7.6
## [51] yaml_2.3.5 SummarizedExperiment_1.27.1
## [53] zlibbioc_1.43.0 brio_1.1.3
## [55] pkgbuild_1.3.1 grid_4.2.1
## [57] blob_1.2.3 parallel_4.2.1
## [59] promises_1.2.0.1 crayon_1.5.1
## [61] lattice_0.20-45 Biostrings_2.65.1
## [63] KEGGREST_1.37.2 knitr_1.39
## [65] ps_1.7.1 pillar_1.7.0
## [67] rjson_0.2.21 codetools_0.2-18
## [69] pkgload_1.3.0 XML_3.99-0.10
## [71] glue_1.6.2 BiocVersion_3.16.0
## [73] evaluate_0.15 remotes_2.4.2
## [75] BiocManager_1.30.18 png_0.1-7
## [77] vctrs_0.4.1 httpuv_1.6.5
## [79] purrr_0.3.4 assertthat_0.2.1
## [81] cachem_1.0.6 xfun_0.31
## [83] mime_0.12 xtable_1.8-4
## [85] restfulr_0.0.15 later_1.3.0
## [87] tibble_3.1.7 GenomicAlignments_1.33.0
## [89] memoise_2.0.1 ellipsis_0.3.2
## [91] interactiveDisplayBase_1.35.0