Accessing annotations in Bioconductor

Author

Here we will show how to use AnnotationHub and annotation packages in Bioconductor to access both experimental data, and information about things like genes. AnnotationHub is a relatively new method for accessing experimental data and annotation information (introduced in 2013). Previously, the main ways to access annotation information were to download a large package with all the information contained, or to query a web resource called BioMart using the biomaRt Bioconductor package. AnnotationHub makes it much easier to look up and download annotation information within R.

One important usage note: AnnotationHub creates a hidden folder in your home directory, ~/.AnnotationHub which accumulates the data objects you download with the package. In the next topic, we will show how to download the human genome which is ~700 Mb, and so you should know that it is sitting in this hidden folder, in case you need to clear up space later.

library("AnnotationHub")
ah <- AnnotationHub()

One way to see what kind of information is available is to put up an HTML browser using the display function. Search for species: “Homo sapiens” and description: “chip-seq”. (I’ve noticed that using the global search is much slower than using the individual fields.) When you are finished, and want to return to R, you should hit Esc (or Ctrl-C Ctrl-C in an R terminal).

display(ah)

We can look up specific datasets of interest, either by selecting them in the web browser and telling the browser to return these rows to R, or by using the single square bracket with the AH code. This just tells us about the record:

ah["AH28856"]

AnnotationHub with 1 record
# snapshotDate(): 2024-04-30
# names(): AH28856
# $dataprovider: BroadInstitute
# $species: Homo sapiens
# $rdataclass: GRanges
# $rdatadateadded: 2015-05-06
# $title: E001-H3K4me1.broadPeak.gz
# $description: Broad ChIP-seq peaks for consolidated epigenomes from EpigenomeRoadMap Project
# $taxonomyid: 9606
# $genome: hg19
# $sourcetype: BED
# $sourceurl: http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/E001-H3...
# $sourcesize: 7519722
# $tags: c("EpigenomeRoadMap", "peaks", "consolidated", "broadPeak", "E001", "ESC",
#   "ESC.I3", "ES-I3 Cells") 
# retrieve record with 'object[["AH28856"]]'

If we want to actually download this dataset (a set of ChIP-seq peaks for the histone modification H3K4me1), we use the double square brackets. The data are the set of locations in the genome where the histone modification was detected. Each range is associated with a score, a “signal value”, a p-value and a q-value.

peaks <- ah[["AH28856"]]

loading from cache

require("rtracklayer")

peaks

GRanges object with 355118 ranges and 5 metadata columns:
           seqnames              ranges strand |        name     score signalValue    pValue
              <Rle>           <IRanges>  <Rle> | <character> <numeric>   <numeric> <numeric>
       [1]     chr3 195423087-195426700      * |      Rank_1        78     3.23874  10.67458
       [2]    chr17   41080552-41081406      * |      Rank_2        72     3.70866   9.91354
       [3]    chr16   65269248-65270757      * |      Rank_3        71     4.42401   9.75522
       [4]    chr10   46973856-46977224      * |      Rank_4        70     3.00554   9.70745
       [5]    chr17   48425051-48426662      * |      Rank_5        69     3.91848   9.56466
       ...      ...                 ...    ... .         ...       ...         ...       ...
  [355114]     chr6 100602736-100603151      * | Rank_355114         0     1.46240   1.00757
  [355115]     chr6   18251768-18251961      * | Rank_355115         0     1.46240   1.00757
  [355116]     chrX   21444665-21445036      * | Rank_355116         0     1.46240   1.00757
  [355117]    chr17   77913489-77913735      * | Rank_355117         0     1.41873   1.00674
  [355118]     chr4     6712348-6712652      * | Rank_355118         0     1.41583   1.00665
              qValue
           <numeric>
       [1]   7.87721
       [2]   7.23544
       [3]   7.11566
       [4]   7.07189
       [5]   6.93256
       ...       ...
  [355114]   0.04592
  [355115]   0.04592
  [355116]   0.04592
  [355117]   0.04497
  [355118]   0.04495
  -------
  seqinfo: 298 sequences (2 circular) from hg19 genome

We have the chromosome and genome information as well:

seqinfo(peaks)

Seqinfo object with 298 sequences (2 circular) from hg19 genome:
  seqnames       seqlengths isCircular genome
  chr1            249250621      FALSE   hg19
  chr2            243199373      FALSE   hg19
  chr3            198022430      FALSE   hg19
  chr4            191154276      FALSE   hg19
  chr5            180915260      FALSE   hg19
  ...                   ...        ...    ...
  chrUn_gl000245      36651      FALSE   hg19
  chrUn_gl000246      38154      FALSE   hg19
  chrUn_gl000247      36422      FALSE   hg19
  chrUn_gl000248      39786      FALSE   hg19
  chrUn_gl000249      38502      FALSE   hg19

Organism databases for mapping gene identifiers

Another type of annotation we can pull down using AnnotationHub is an OrgDb or organism database. We subset the hub to only those objects that are of type OrgDb, and then we query for Homo sapiens.

orgs <- subset(ah, ah$rdataclass == "OrgDb")
orgdb <- query(orgs, "Homo sapiens")[[1]]

loading from cache

Loading required package: AnnotationDbi

Loading required package: Biobase

Welcome to Bioconductor

    Vignettes contain introductory material; view with 'browseVignettes()'. To cite
    Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: 'Biobase'

The following object is masked from 'package:AnnotationHub':

    cache

We can see the OrgDb bundles a lot of information from various sources (Entrez, GO, KEGG, UCSC, Ensembl, Uniprot).

orgdb

OrgDb object:
| DBSCHEMAVERSION: 2.1
| Db type: OrgDb
| Supporting package: AnnotationDbi
| DBSCHEMA: HUMAN_DB
| ORGANISM: Homo sapiens
| SPECIES: Human
| EGSOURCEDATE: 2024-Mar12
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: EG
| TAXID: 9606
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
| GOSOURCEDATE: 2024-01-17
| GOEGSOURCEDATE: 2024-Mar12
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
| GPSOURCEURL: 
| GPSOURCEDATE: 2024-Feb29
| ENSOURCEDATE: 2023-Nov22
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
| UPSOURCENAME: Uniprot
| UPSOURCEURL: http://www.UniProt.org/
| UPSOURCEDATE: Thu Apr 18 21:39:39 2024


Please see: help('select') for usage information

The columns function tells us what identifiers are contained in the database, where the elements of the database are genes or associated proteins.

columns(orgdb)

 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"    
 [7] "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"     "GENETYPE"     "GO"          
[13] "GOALL"        "IPI"          "MAP"          "OMIM"         "ONTOLOGY"     "ONTOLOGYALL" 
[19] "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"      
[25] "UCSCKG"       "UNIPROT"

I typically like to use SFTPB as an example gene. If you follow the link you can see it’s expression profile in humans across various tissues. It’s a gene that is almost exclusively expressed in lung, and it plays a role in producing surfactant, which reduces the surface tension in the alveoli in the lung. The lung is like a big balloon, but with many small sacks called alveoli, which produce a massive amount of surface area in a small volume. Without surfactant, we wouldn’t have the muscle power to expand our lungs.

Basic annotation tables with `select`

We can look up more information about SFTPB using select. We provide a key, which is the gene symbol, and ask for the gene name, a column of orgdb. Note that select returns a data.frame with the key and the columns (which could have been more than one):

select(orgdb, keys="SFTPB", columns="GENENAME", keytype="SYMBOL")

'select()' returned 1:1 mapping between keys and columns

  SYMBOL             GENENAME
1  SFTPB surfactant protein B

Mapping identifiers with `mapIds`

We can also map across different identifiers, e.g.:

select(orgdb, keys="SFTPB", columns="ENSEMBL", keytype="SYMBOL")

'select()' returned 1:1 mapping between keys and columns

  SYMBOL         ENSEMBL
1  SFTPB ENSG00000168878

select(orgdb, keys="ENSG00000168878", columns="SYMBOL", keytype="ENSEMBL")

'select()' returned 1:1 mapping between keys and columns

          ENSEMBL SYMBOL
1 ENSG00000168878  SFTPB

Sometimes, we don’t want a table but just a vector of IDs. For this, we can use the function mapIds:

k <- c("SFTPB","NR3C1","NFKB1")
mapIds(orgdb, keys=k,
       column="ENSEMBL", keytype="SYMBOL",
       multiVals="asNA")

'select()' returned 1:1 mapping between keys and columns

            SFTPB             NR3C1             NFKB1 
"ENSG00000168878" "ENSG00000113580" "ENSG00000109320"

If a key matches to multiple values for column, then you can choose what to do (see options in ?mapIds). Above the function will return an NA, but it is also possible to return the first element, or a list instead of a vector, or a CharacterList etc. You can even define a FUN function to perform the filtering in case of multiple values.

Gene ontology terms

We can find out what roles a gene play, by looking up the Gene Ontology terms, or GO terms:

go <- select(orgdb, "SFTPB", "GO", "SYMBOL")

'select()' returned 1:many mapping between keys and columns

go

   SYMBOL         GO EVIDENCE ONTOLOGY
1   SFTPB GO:0005515      IPI       MF
2   SFTPB GO:0005576      TAS       CC
3   SFTPB GO:0005764      IEA       CC
4   SFTPB GO:0005771      IBA       CC
5   SFTPB GO:0005789      TAS       CC
6   SFTPB GO:0006665      IEA       BP
7   SFTPB GO:0007585      IEA       BP
8   SFTPB GO:0009887      TAS       BP
9   SFTPB GO:0042599      TAS       CC
10  SFTPB GO:0045334      TAS       CC
11  SFTPB GO:0097208      IBA       CC
12  SFTPB GO:0097486      TAS       CC

These are not informative alone, just, e.g. GO:0001664. Let’s restict to the terms that represent “biological processes”:

go <- go[go$ONTOLOGY == "BP",] # biological processes
go

  SYMBOL         GO EVIDENCE ONTOLOGY
6  SFTPB GO:0006665      IEA       BP
7  SFTPB GO:0007585      IEA       BP
8  SFTPB GO:0009887      TAS       BP

Now we need to look up what the GO terms mean, using a separate database, called GO.db:

library(GO.db)

columns(GO.db)

[1] "DEFINITION" "GOID"       "ONTOLOGY"   "TERM"

go2 <- select(GO.db, go$GO, c("TERM","DEFINITION"), "GOID")

'select()' returned 1:1 mapping between keys and columns

go2$TERM

[1] "sphingolipid metabolic process"                    
[2] "respiratory gaseous exchange by respiratory system"
[3] "animal organ morphogenesis"

As expected, one of the terms has to do with respiratory gaseous exchange. Also, the surfactant is made up of many lipids, so the first term is not surprising. Sometimes very generic categories show up here, so it’s relevant to know how many genes a GO term is associated with. Here we will use the Ensembl IDs for genes.

sapply(go$GO, function(term) nrow(select(orgdb, term, "ENSEMBL", "GO")))

'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns

GO:0006665 GO:0007585 GO:0009887 
        29         30        121

Another gene I often use as an example is the glucocorticoid receptor, which is the protein that responds to the hormone cortisol, the “stress hormone”. We can repeat these lines of code with the gene symbol for glucocorticoid receptor (see the alias “GR”):

select(orgdb, "NR3C1", "GENENAME", "SYMBOL")

'select()' returned 1:1 mapping between keys and columns

  SYMBOL                                      GENENAME
1  NR3C1 nuclear receptor subfamily 3 group C member 1

select(orgdb, "NR3C1", "ALIAS", "SYMBOL")

'select()' returned 1:many mapping between keys and columns

  SYMBOL ALIAS
1  NR3C1  GCCR
2  NR3C1   GCR
3  NR3C1 GCRST
4  NR3C1    GR
5  NR3C1   GRL
6  NR3C1 NR3C1

go <- select(orgdb, "NR3C1", "GO", "SYMBOL")

'select()' returned 1:many mapping between keys and columns

go <- go[go$ONTOLOGY == "BP",] # biological processes
go2 <- select(GO.db, go$GO, c("TERM","DEFINITION"), "GOID")

'select()' returned many:1 mapping between keys and columns

go2$TERM

 [1] "negative regulation of transcription by RNA polymerase II"    
 [2] "negative regulation of transcription by RNA polymerase II"    
 [3] "regulation of gluconeogenesis"                                
 [4] "chromatin organization"                                       
 [5] "regulation of DNA-templated transcription"                    
 [6] "regulation of transcription by RNA polymerase II"             
 [7] "apoptotic process"                                            
 [8] "chromosome segregation"                                       
 [9] "signal transduction"                                          
[10] "glucocorticoid metabolic process"                             
[11] "gene expression"                                              
[12] "microglia differentiation"                                    
[13] "adrenal gland development"                                    
[14] "intracellular steroid hormone receptor signaling pathway"     
[15] "regulation of glucocorticoid biosynthetic process"            
[16] "synaptic transmission, glutamatergic"                         
[17] "maternal behavior"                                            
[18] "glucocorticoid receptor signaling pathway"                    
[19] "glucocorticoid mediated signaling pathway"                    
[20] "positive regulation of neuron apoptotic process"              
[21] "negative regulation of DNA-templated transcription"           
[22] "positive regulation of transcription by RNA polymerase II"    
[23] "positive regulation of transcription by RNA polymerase II"    
[24] "astrocyte differentiation"                                    
[25] "cell division"                                                
[26] "nucleus localization"                                         
[27] "mammary gland duct morphogenesis"                             
[28] "motor behavior"                                               
[29] "cellular response to steroid hormone stimulus"                
[30] "cellular response to glucocorticoid stimulus"                 
[31] "cellular response to dexamethasone stimulus"                  
[32] "cellular response to transforming growth factor beta stimulus"
[33] "neuroinflammatory response"                                   
[34] "positive regulation of miRNA transcription"

Annotation-based consequence of gene isoforms

As we showed in the GRangesList lecture note how genes can have different isoforms (or “transcripts”), there are also annotation resources for understanding the differences between these isoforms. A package IsoformSwitchAnalyzeR provides a pipeline for analyzing the consequence of changes between isoforms.

In particular see the section of the vignette, Predicting switch consequences.

This package provides a function called switchPlot, that shows:

The isoform structures along with the concatenated annotations (including transcript classification, ORF, Coding Potential, NMD sensitivity, annotated protein domains as well as annotated signal peptides)

Gene and isoform expression and any annotated differential expression analysis of these

Isoform usage including the result of the isoform switch test

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] GO.db_3.19.1         AnnotationDbi_1.66.0 Biobase_2.64.0       rtracklayer_1.64.0  
 [5] GenomicRanges_1.56.1 GenomeInfoDb_1.40.1  IRanges_2.38.1       S4Vectors_0.42.1    
 [9] AnnotationHub_3.12.0 BiocFileCache_2.12.0 dbplyr_2.5.0         BiocGenerics_0.50.0 
[13] testthat_3.2.1.1     rmarkdown_2.27       devtools_2.4.5       usethis_3.0.0       

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                   bitops_1.0-8                remotes_2.5.0              
 [4] rlang_1.1.4                 magrittr_2.0.3              matrixStats_1.3.0          
 [7] compiler_4.4.1              RSQLite_2.3.7               png_0.1-8                  
[10] vctrs_0.6.5                 stringr_1.5.1               profvis_0.3.8              
[13] pkgconfig_2.0.3             crayon_1.5.3                fastmap_1.2.0              
[16] XVector_0.44.0              ellipsis_0.3.2              utf8_1.2.4                 
[19] Rsamtools_2.20.0            promises_1.3.0              sessioninfo_1.2.2          
[22] UCSC.utils_1.0.0            purrr_1.0.2                 bit_4.0.5                  
[25] xfun_0.46                   zlibbioc_1.50.0             cachem_1.1.0               
[28] jsonlite_1.8.8              blob_1.2.4                  later_1.3.2                
[31] DelayedArray_0.30.1         BiocParallel_1.38.0         parallel_4.4.1             
[34] R6_2.5.1                    stringi_1.8.4               pkgload_1.4.0              
[37] brio_1.1.5                  Rcpp_1.0.13                 SummarizedExperiment_1.34.0
[40] knitr_1.48                  httpuv_1.6.15               Matrix_1.7-0               
[43] tidyselect_1.2.1            rstudioapi_0.16.0           abind_1.4-5                
[46] yaml_2.3.10                 codetools_0.2-20            miniUI_0.1.1.1             
[49] curl_5.2.1                  pkgbuild_1.4.4              lattice_0.22-6             
[52] tibble_3.2.1                shiny_1.9.1                 withr_3.0.1                
[55] KEGGREST_1.44.1             evaluate_0.24.0             urlchecker_1.0.1           
[58] Biostrings_2.72.1           pillar_1.9.0                BiocManager_1.30.23        
[61] filelock_1.0.3              MatrixGenerics_1.16.0       generics_0.1.3             
[64] RCurl_1.98-1.16             BiocVersion_3.19.1          xtable_1.8-4               
[67] glue_1.7.0                  tools_4.4.1                 BiocIO_1.14.0              
[70] BSgenome_1.72.0             GenomicAlignments_1.40.0    fs_1.6.4                   
[73] XML_3.99-0.17               grid_4.4.1                  GenomeInfoDbData_1.2.12    
[76] restfulr_0.0.15             cli_3.6.3                   rappdirs_0.3.3             
[79] fansi_1.0.6                 S4Arrays_1.4.1              dplyr_1.1.4                
[82] digest_0.6.36               SparseArray_1.4.8           rjson_0.2.21               
[85] htmlwidgets_1.6.4           memoise_2.0.1               htmltools_0.5.8.1          
[88] lifecycle_1.0.4             httr_1.4.7                  mime_0.12                  
[91] bit64_4.0.5

Organism databases for mapping gene identifiers

Basic annotation tables with select

Mapping identifiers with mapIds

Gene ontology terms

Annotation-based consequence of gene isoforms

Basic annotation tables with `select`

Mapping identifiers with `mapIds`