library("AnnotationHub")
<- AnnotationHub() ah
Accessing annotations in Bioconductor
Here we will show how to use AnnotationHub and annotation packages in Bioconductor to access both experimental data, and information about things like genes. AnnotationHub is a relatively new method for accessing experimental data and annotation information (introduced in 2013). Previously, the main ways to access annotation information were to download a large package with all the information contained, or to query a web resource called BioMart using the biomaRt Bioconductor package. AnnotationHub makes it much easier to look up and download annotation information within R.
One important usage note: AnnotationHub creates a hidden folder in your home directory, ~/.AnnotationHub
which accumulates the data objects you download with the package. In the next topic, we will show how to download the human genome which is ~700 Mb, and so you should know that it is sitting in this hidden folder, in case you need to clear up space later.
One way to see what kind of information is available is to put up an HTML browser using the display
function. Search for species: “Homo sapiens” and description: “chip-seq”. (I’ve noticed that using the global search is much slower than using the individual fields.) When you are finished, and want to return to R, you should hit Esc
(or Ctrl-C Ctrl-C
in an R terminal).
display(ah)
We can look up specific datasets of interest, either by selecting them in the web browser and telling the browser to return these rows to R, or by using the single square bracket with the AH
code. This just tells us about the record:
"AH28856"] ah[
AnnotationHub with 1 record
# snapshotDate(): 2024-04-30
# names(): AH28856
# $dataprovider: BroadInstitute
# $species: Homo sapiens
# $rdataclass: GRanges
# $rdatadateadded: 2015-05-06
# $title: E001-H3K4me1.broadPeak.gz
# $description: Broad ChIP-seq peaks for consolidated epigenomes from EpigenomeRoadMap Project
# $taxonomyid: 9606
# $genome: hg19
# $sourcetype: BED
# $sourceurl: http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/E001-H3...
# $sourcesize: 7519722
# $tags: c("EpigenomeRoadMap", "peaks", "consolidated", "broadPeak", "E001", "ESC",
# "ESC.I3", "ES-I3 Cells")
# retrieve record with 'object[["AH28856"]]'
If we want to actually download this dataset (a set of ChIP-seq peaks for the histone modification H3K4me1), we use the double square brackets. The data are the set of locations in the genome where the histone modification was detected. Each range is associated with a score, a “signal value”, a p-value and a q-value.
<- ah[["AH28856"]] peaks
loading from cache
require("rtracklayer")
peaks
GRanges object with 355118 ranges and 5 metadata columns:
seqnames ranges strand | name score signalValue pValue
<Rle> <IRanges> <Rle> | <character> <numeric> <numeric> <numeric>
[1] chr3 195423087-195426700 * | Rank_1 78 3.23874 10.67458
[2] chr17 41080552-41081406 * | Rank_2 72 3.70866 9.91354
[3] chr16 65269248-65270757 * | Rank_3 71 4.42401 9.75522
[4] chr10 46973856-46977224 * | Rank_4 70 3.00554 9.70745
[5] chr17 48425051-48426662 * | Rank_5 69 3.91848 9.56466
... ... ... ... . ... ... ... ...
[355114] chr6 100602736-100603151 * | Rank_355114 0 1.46240 1.00757
[355115] chr6 18251768-18251961 * | Rank_355115 0 1.46240 1.00757
[355116] chrX 21444665-21445036 * | Rank_355116 0 1.46240 1.00757
[355117] chr17 77913489-77913735 * | Rank_355117 0 1.41873 1.00674
[355118] chr4 6712348-6712652 * | Rank_355118 0 1.41583 1.00665
qValue
<numeric>
[1] 7.87721
[2] 7.23544
[3] 7.11566
[4] 7.07189
[5] 6.93256
... ...
[355114] 0.04592
[355115] 0.04592
[355116] 0.04592
[355117] 0.04497
[355118] 0.04495
-------
seqinfo: 298 sequences (2 circular) from hg19 genome
We have the chromosome and genome information as well:
seqinfo(peaks)
Seqinfo object with 298 sequences (2 circular) from hg19 genome:
seqnames seqlengths isCircular genome
chr1 249250621 FALSE hg19
chr2 243199373 FALSE hg19
chr3 198022430 FALSE hg19
chr4 191154276 FALSE hg19
chr5 180915260 FALSE hg19
... ... ... ...
chrUn_gl000245 36651 FALSE hg19
chrUn_gl000246 38154 FALSE hg19
chrUn_gl000247 36422 FALSE hg19
chrUn_gl000248 39786 FALSE hg19
chrUn_gl000249 38502 FALSE hg19
Organism databases for mapping gene identifiers
Another type of annotation we can pull down using AnnotationHub is an OrgDb or organism database. We subset the hub to only those objects that are of type OrgDb, and then we query for Homo sapiens.
<- subset(ah, ah$rdataclass == "OrgDb")
orgs <- query(orgs, "Homo sapiens")[[1]] orgdb
loading from cache
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with 'browseVignettes()'. To cite
Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:AnnotationHub':
cache
We can see the OrgDb bundles a lot of information from various sources (Entrez, GO, KEGG, UCSC, Ensembl, Uniprot).
orgdb
OrgDb object:
| DBSCHEMAVERSION: 2.1
| Db type: OrgDb
| Supporting package: AnnotationDbi
| DBSCHEMA: HUMAN_DB
| ORGANISM: Homo sapiens
| SPECIES: Human
| EGSOURCEDATE: 2024-Mar12
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: EG
| TAXID: 9606
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
| GOSOURCEDATE: 2024-01-17
| GOEGSOURCEDATE: 2024-Mar12
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
| GPSOURCEURL:
| GPSOURCEDATE: 2024-Feb29
| ENSOURCEDATE: 2023-Nov22
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
| UPSOURCENAME: Uniprot
| UPSOURCEURL: http://www.UniProt.org/
| UPSOURCEDATE: Thu Apr 18 21:39:39 2024
Please see: help('select') for usage information
The columns
function tells us what identifiers are contained in the database, where the elements of the database are genes or associated proteins.
columns(orgdb)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID"
[7] "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" "GENETYPE" "GO"
[13] "GOALL" "IPI" "MAP" "OMIM" "ONTOLOGY" "ONTOLOGYALL"
[19] "PATH" "PFAM" "PMID" "PROSITE" "REFSEQ" "SYMBOL"
[25] "UCSCKG" "UNIPROT"
I typically like to use SFTPB as an example gene. If you follow the link you can see it’s expression profile in humans across various tissues. It’s a gene that is almost exclusively expressed in lung, and it plays a role in producing surfactant, which reduces the surface tension in the alveoli in the lung. The lung is like a big balloon, but with many small sacks called alveoli, which produce a massive amount of surface area in a small volume. Without surfactant, we wouldn’t have the muscle power to expand our lungs.
Basic annotation tables with select
We can look up more information about SFTPB using select
. We provide a key, which is the gene symbol, and ask for the gene name, a column of orgdb
. Note that select
returns a data.frame with the key
and the columns
(which could have been more than one):
select(orgdb, keys="SFTPB", columns="GENENAME", keytype="SYMBOL")
'select()' returned 1:1 mapping between keys and columns
SYMBOL GENENAME
1 SFTPB surfactant protein B
Mapping identifiers with mapIds
We can also map across different identifiers, e.g.:
select(orgdb, keys="SFTPB", columns="ENSEMBL", keytype="SYMBOL")
'select()' returned 1:1 mapping between keys and columns
SYMBOL ENSEMBL
1 SFTPB ENSG00000168878
select(orgdb, keys="ENSG00000168878", columns="SYMBOL", keytype="ENSEMBL")
'select()' returned 1:1 mapping between keys and columns
ENSEMBL SYMBOL
1 ENSG00000168878 SFTPB
Sometimes, we don’t want a table but just a vector of IDs. For this, we can use the function mapIds
:
<- c("SFTPB","NR3C1","NFKB1")
k mapIds(orgdb, keys=k,
column="ENSEMBL", keytype="SYMBOL",
multiVals="asNA")
'select()' returned 1:1 mapping between keys and columns
SFTPB NR3C1 NFKB1
"ENSG00000168878" "ENSG00000113580" "ENSG00000109320"
If a key matches to multiple values for column
, then you can choose what to do (see options in ?mapIds
). Above the function will return an NA
, but it is also possible to return the first
element, or a list
instead of a vector, or a CharacterList
etc. You can even define a FUN
function to perform the filtering in case of multiple values.
Gene ontology terms
We can find out what roles a gene play, by looking up the Gene Ontology terms, or GO terms:
<- select(orgdb, "SFTPB", "GO", "SYMBOL") go
'select()' returned 1:many mapping between keys and columns
go
SYMBOL GO EVIDENCE ONTOLOGY
1 SFTPB GO:0005515 IPI MF
2 SFTPB GO:0005576 TAS CC
3 SFTPB GO:0005764 IEA CC
4 SFTPB GO:0005771 IBA CC
5 SFTPB GO:0005789 TAS CC
6 SFTPB GO:0006665 IEA BP
7 SFTPB GO:0007585 IEA BP
8 SFTPB GO:0009887 TAS BP
9 SFTPB GO:0042599 TAS CC
10 SFTPB GO:0045334 TAS CC
11 SFTPB GO:0097208 IBA CC
12 SFTPB GO:0097486 TAS CC
These are not informative alone, just, e.g. GO:0001664
. Let’s restict to the terms that represent “biological processes”:
<- go[go$ONTOLOGY == "BP",] # biological processes
go go
SYMBOL GO EVIDENCE ONTOLOGY
6 SFTPB GO:0006665 IEA BP
7 SFTPB GO:0007585 IEA BP
8 SFTPB GO:0009887 TAS BP
Now we need to look up what the GO terms mean, using a separate database, called GO.db
:
library(GO.db)
columns(GO.db)
[1] "DEFINITION" "GOID" "ONTOLOGY" "TERM"
<- select(GO.db, go$GO, c("TERM","DEFINITION"), "GOID") go2
'select()' returned 1:1 mapping between keys and columns
$TERM go2
[1] "sphingolipid metabolic process"
[2] "respiratory gaseous exchange by respiratory system"
[3] "animal organ morphogenesis"
As expected, one of the terms has to do with respiratory gaseous exchange. Also, the surfactant is made up of many lipids, so the first term is not surprising. Sometimes very generic categories show up here, so it’s relevant to know how many genes a GO term is associated with. Here we will use the Ensembl IDs for genes.
sapply(go$GO, function(term) nrow(select(orgdb, term, "ENSEMBL", "GO")))
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
GO:0006665 GO:0007585 GO:0009887
29 30 121
Another gene I often use as an example is the glucocorticoid receptor, which is the protein that responds to the hormone cortisol, the “stress hormone”. We can repeat these lines of code with the gene symbol for glucocorticoid receptor (see the alias “GR”):
select(orgdb, "NR3C1", "GENENAME", "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
SYMBOL GENENAME
1 NR3C1 nuclear receptor subfamily 3 group C member 1
select(orgdb, "NR3C1", "ALIAS", "SYMBOL")
'select()' returned 1:many mapping between keys and columns
SYMBOL ALIAS
1 NR3C1 GCCR
2 NR3C1 GCR
3 NR3C1 GCRST
4 NR3C1 GR
5 NR3C1 GRL
6 NR3C1 NR3C1
<- select(orgdb, "NR3C1", "GO", "SYMBOL") go
'select()' returned 1:many mapping between keys and columns
<- go[go$ONTOLOGY == "BP",] # biological processes
go <- select(GO.db, go$GO, c("TERM","DEFINITION"), "GOID") go2
'select()' returned many:1 mapping between keys and columns
$TERM go2
[1] "negative regulation of transcription by RNA polymerase II"
[2] "negative regulation of transcription by RNA polymerase II"
[3] "regulation of gluconeogenesis"
[4] "chromatin organization"
[5] "regulation of DNA-templated transcription"
[6] "regulation of transcription by RNA polymerase II"
[7] "apoptotic process"
[8] "chromosome segregation"
[9] "signal transduction"
[10] "glucocorticoid metabolic process"
[11] "gene expression"
[12] "microglia differentiation"
[13] "adrenal gland development"
[14] "intracellular steroid hormone receptor signaling pathway"
[15] "regulation of glucocorticoid biosynthetic process"
[16] "synaptic transmission, glutamatergic"
[17] "maternal behavior"
[18] "glucocorticoid receptor signaling pathway"
[19] "glucocorticoid mediated signaling pathway"
[20] "positive regulation of neuron apoptotic process"
[21] "negative regulation of DNA-templated transcription"
[22] "positive regulation of transcription by RNA polymerase II"
[23] "positive regulation of transcription by RNA polymerase II"
[24] "astrocyte differentiation"
[25] "cell division"
[26] "nucleus localization"
[27] "mammary gland duct morphogenesis"
[28] "motor behavior"
[29] "cellular response to steroid hormone stimulus"
[30] "cellular response to glucocorticoid stimulus"
[31] "cellular response to dexamethasone stimulus"
[32] "cellular response to transforming growth factor beta stimulus"
[33] "neuroinflammatory response"
[34] "positive regulation of miRNA transcription"
Annotation-based consequence of gene isoforms
As we showed in the GRangesList lecture note how genes can have different isoforms (or “transcripts”), there are also annotation resources for understanding the differences between these isoforms. A package IsoformSwitchAnalyzeR provides a pipeline for analyzing the consequence of changes between isoforms.
In particular see the section of the vignette, Predicting switch consequences.
This package provides a function called switchPlot
, that shows:
The isoform structures along with the concatenated annotations (including transcript classification, ORF, Coding Potential, NMD sensitivity, annotated protein domains as well as annotated signal peptides)
Gene and isoform expression and any annotated differential expression analysis of these
Isoform usage including the result of the isoform switch test
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods base
other attached packages:
[1] GO.db_3.19.1 AnnotationDbi_1.66.0 Biobase_2.64.0 rtracklayer_1.64.0
[5] GenomicRanges_1.56.1 GenomeInfoDb_1.40.1 IRanges_2.38.1 S4Vectors_0.42.1
[9] AnnotationHub_3.12.0 BiocFileCache_2.12.0 dbplyr_2.5.0 BiocGenerics_0.50.0
[13] testthat_3.2.1.1 rmarkdown_2.27 devtools_2.4.5 usethis_3.0.0
loaded via a namespace (and not attached):
[1] DBI_1.2.3 bitops_1.0-8 remotes_2.5.0
[4] rlang_1.1.4 magrittr_2.0.3 matrixStats_1.3.0
[7] compiler_4.4.1 RSQLite_2.3.7 png_0.1-8
[10] vctrs_0.6.5 stringr_1.5.1 profvis_0.3.8
[13] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
[16] XVector_0.44.0 ellipsis_0.3.2 utf8_1.2.4
[19] Rsamtools_2.20.0 promises_1.3.0 sessioninfo_1.2.2
[22] UCSC.utils_1.0.0 purrr_1.0.2 bit_4.0.5
[25] xfun_0.46 zlibbioc_1.50.0 cachem_1.1.0
[28] jsonlite_1.8.8 blob_1.2.4 later_1.3.2
[31] DelayedArray_0.30.1 BiocParallel_1.38.0 parallel_4.4.1
[34] R6_2.5.1 stringi_1.8.4 pkgload_1.4.0
[37] brio_1.1.5 Rcpp_1.0.13 SummarizedExperiment_1.34.0
[40] knitr_1.48 httpuv_1.6.15 Matrix_1.7-0
[43] tidyselect_1.2.1 rstudioapi_0.16.0 abind_1.4-5
[46] yaml_2.3.10 codetools_0.2-20 miniUI_0.1.1.1
[49] curl_5.2.1 pkgbuild_1.4.4 lattice_0.22-6
[52] tibble_3.2.1 shiny_1.9.1 withr_3.0.1
[55] KEGGREST_1.44.1 evaluate_0.24.0 urlchecker_1.0.1
[58] Biostrings_2.72.1 pillar_1.9.0 BiocManager_1.30.23
[61] filelock_1.0.3 MatrixGenerics_1.16.0 generics_0.1.3
[64] RCurl_1.98-1.16 BiocVersion_3.19.1 xtable_1.8-4
[67] glue_1.7.0 tools_4.4.1 BiocIO_1.14.0
[70] BSgenome_1.72.0 GenomicAlignments_1.40.0 fs_1.6.4
[73] XML_3.99-0.17 grid_4.4.1 GenomeInfoDbData_1.2.12
[76] restfulr_0.0.15 cli_3.6.3 rappdirs_0.3.3
[79] fansi_1.0.6 S4Arrays_1.4.1 dplyr_1.1.4
[82] digest_0.6.36 SparseArray_1.4.8 rjson_0.2.21
[85] htmlwidgets_1.6.4 memoise_2.0.1 htmltools_0.5.8.1
[88] lifecycle_1.0.4 httr_1.4.7 mime_0.12
[91] bit64_4.0.5