In this last section, we combine what we learned about ranges and annotation in order to obtain sequences of DNA from a genome.

First we introduce the Bioconductor package that we use to manipulate DNA strings (also RNA strings or amino acid strings). First, I load the package, and create a short character string with some letters of DNA:

library(Biostrings)
dna <- "AAACGCG"
class(dna)
## [1] "character"

A simple character string in R doesn’t help us much for working with DNA. We can subset out parts of it using base R, but there’s a not of functionality missing. Note what we can do once we turn it into a DNAString object. The reverse complement of a string is the letters you would see on the antisense strand of DNA, going from right to left. This is a useful function; for example, if we have a gene that is transcribed on the - strand, the RNA that is produced will be the reverse complement of the DNA on the + strand (this is the DNA that you see in the reference genome, which we will load later).

dna <- DNAString("AAACGCG")
reverseComplement(dna)
## 7-letter DNAString object
## seq: CGCGTTT

A DNAStringSet is a vector of DNAStrings. If we want to know how many strings, we use length, whereas if we want to know how many letters per DNAString, we use width:

dna <- DNAStringSet(c("AAAAA","CGCG","TCG"))
dna
## DNAStringSet object of length 3:
##     width seq
## [1]     5 AAAAA
## [2]     4 CGCG
## [3]     3 TCG
length(dna)
## [1] 3
width(dna)
## [1] 5 4 3

There are a number of functions in the Biostrings package to help quickly tabulate things like letter frequency in the strings:

letterFrequency(dna, "C")
##      C
## [1,] 0
## [2,] 2
## [3,] 1
letterFrequency(dna, "C", as.prob=TRUE)
##              C
## [1,] 0.0000000
## [2,] 0.5000000
## [3,] 0.3333333
letterFrequency(dna, "CG", as.prob=TRUE)
##            C|G
## [1,] 0.0000000
## [2,] 1.0000000
## [3,] 0.6666667

We can also look for matches of a query in a set of strings, using the function vmatchPattern. The elementNROWS function lets us know how many matches were found per string in dna. And then we can see where the matches begin and end by looking into individual elements. Note that the matches are described as IRanges.

dna <- DNAStringSet(c("AACTCTC","CTCTAAA","AAAGAG"))
matches <- vmatchPattern("CTC", dna)
elementNROWS(matches)
## [1] 2 1 0
matches[[1]]
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         3         5         3
##   [2]         5         7         3

The opposite setup: we can look for a set of queries within a longer string, using matchPDict. These function may sound like they have funny names but the names often have to do with the data structures or algorithms used to quickly compute the matches.

dna <- DNAString("AAACTCAAAGAGAAATTTAAA")
pd <- PDict(c("CTC","GAG","TTT","AAA"))
matches <- matchPDict(pd, dna)
elementNROWS(matches)
## [1] 1 1 1 4
matches[[4]]
## IRanges object with 4 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1         3         3
##   [2]         7         9         3
##   [3]        13        15         3
##   [4]        19        21         3

These functions from Biostrings are fast when run in vectorized fashion (providing a vector of strings to match). Note that if you had millions of strings to match against a reference (e.g. short read RNA-seq), there are dedicated command line tools to perform such mapping/alignment (and which may take into account introns when aligning RNA reads to the genome). We will discuss these tools in a later lecture. There are also Bioconductor packages which can perform alignment, including Rsubread and gmapR.

Working with genomes

Many genomes exist as packages on Bioconductor, for example:

We will work with the UCSC version of the hg38/GRCh38 genome, the second package above. This package is ~670 Mb, so you may want to keep track of where it is kept on your computer. You can look up the library path using .libPaths(), and you can remove packages with remove.packages().

But first, I want to show you also how to work with genomes that don’t exist as a package on Bioconductor. The first step would be to load up the AnnotationHub package:

library(AnnotationHub)
## Loading required package: BiocFileCache
## Loading required package: dbplyr
ah <- AnnotationHub()
## snapshotDate(): 2022-09-09

As before, you can look through a web browser to find a particular genome. For example, try searching for species: Homo sapiens, genome: grch38, description: FASTA. Notice the different descriptions. Note that “cDNA” stands for the sequence that is transcribed into RNA (so the exon sequence of transcripts).

display(ah)

We can also query the AnnotationHub using the query function. Suppose we wanted the GRCh38 human genome (though we have it above in a package). We look for the files labeled dna.primary_assembly:

res <- query(ah, c("ensembl","GRCh38","dna.primary_assembly"))
head(res$sourceurl)
## [1] "ftp://ftp.ensembl.org/pub/release-82/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
## [2] "ftp://ftp.ensembl.org/pub/release-83/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
## [3] "ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
## [4] "ftp://ftp.ensembl.org/pub/release-85/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
## [5] "ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
## [6] "ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"

We could then download one of these files using double square brackets, to pick which of the genomes in res. We won’t do that. Instead, we will work with the package described above, which can be downloaded with Bioconductor’s installation function.

suppressPackageStartupMessages(library(BSgenome.Hsapiens.NCBI.GRCh38))
Hsapiens
## Human genome:
## # organism: Homo sapiens (Human)
## # genome: GRCh38
## # provider: NCBI
## # release date: 2013-12-17
## # 455 sequences:
## #   1                                  2                                 
## #   3                                  4                                 
## #   5                                  6                                 
## #   7                                  8                                 
## #   9                                  10                                
## #   ...                                ...                               
## #   HSCHR19KIR_FH05_B_HAP_CTG3_1       HSCHR19KIR_FH06_A_HAP_CTG3_1      
## #   HSCHR19KIR_FH06_BA1_HAP_CTG3_1     HSCHR19KIR_FH08_A_HAP_CTG3_1      
## #   HSCHR19KIR_FH08_BAX_HAP_CTG3_1     HSCHR19KIR_FH13_A_HAP_CTG3_1      
## #   HSCHR19KIR_FH13_BA2_HAP_CTG3_1     HSCHR19KIR_FH15_A_HAP_CTG3_1      
## #   HSCHR19KIR_RP5_B_HAP_CTG3_1                                          
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator to access a given
## # sequence)

We can extract particular sequences from a reference genome using GRanges. Here we choose the NCBI version of the genome. Note that the difference between the “UCSC” and “NCBI/Ensembl” style of chromosomes is chr1 vs 1.

seqinfo(Hsapiens)
## Seqinfo object with 455 sequences (1 circular) from GRCh38 genome:
##   seqnames                       seqlengths isCircular genome
##   1                               248956422      FALSE GRCh38
##   2                               242193529      FALSE GRCh38
##   3                               198295559      FALSE GRCh38
##   4                               190214555      FALSE GRCh38
##   5                               181538259      FALSE GRCh38
##   ...                                   ...        ...    ...
##   HSCHR19KIR_FH08_BAX_HAP_CTG3_1     200773      FALSE GRCh38
##   HSCHR19KIR_FH13_A_HAP_CTG3_1       170148      FALSE GRCh38
##   HSCHR19KIR_FH13_BA2_HAP_CTG3_1     215732      FALSE GRCh38
##   HSCHR19KIR_FH15_A_HAP_CTG3_1       170537      FALSE GRCh38
##   HSCHR19KIR_RP5_B_HAP_CTG3_1        177381      FALSE GRCh38
gr <- GRanges("1", IRanges(1e6 + c(1,101,201), width=100))
gr
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames          ranges strand
##          <Rle>       <IRanges>  <Rle>
##   [1]        1 1000001-1000100      *
##   [2]        1 1000101-1000200      *
##   [3]        1 1000201-1000300      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
dna <- getSeq(Hsapiens, gr)
dna
## DNAStringSet object of length 3:
##     width seq
## [1]   100 GTGGAGCGCGCCGCCACGGACCACGGGCGGGCTGGCGGGCGAGC...CGATCCGAGCCCCTAGGGCGGATCCCGGCTCCAGGCCCGCGCG
## [2]   100 CGCCTCAGGCCGTTTCCCTATTTAAGGCCTCGCCGCCGCGGGGT...TCCGCATTCTTTCCCACACTCGCCCCAGCCAATCGACGGCCGC
## [3]   100 GCTCCTCCCCCGCTCGCTGTCAGTCACGCCTCGGCTCCGGGCGC...GACAATGGCATGCGGGGCCCCCGCCCGCGCCCCCCTCCGACAC

Working with a transcriptome

We can download the cDNA (“complementary DNA”), which is the sequence of the exons for each transcript. Note that, if the gene is on the - strand, then the cDNA will be the reverse complement of the reference genome, the Hsapiens object above.

txome <- query(ah, c("ensembl","GRCh38","cdna.all","release-86"))[[1]]
## loading from cache

The seqnames in this case are not the chromosomes, they are the names of the transcripts, and there are 178,000 of them. Note that the transcript names have a . and an extra number on the end. This is a version number (if you noticed, we saw these version numbers previously on the rowRanges of the human airway epithelial data in the SummarizedExperiment lecture note).

We will have to chop this extra number off later in order to match with a set of transcript identifiers that do not have the version number.

seqinfo(txome)
## Seqinfo object with 178136 sequences from an unspecified genome:
##   seqnames          seqlengths isCircular genome
##   ENST00000448914.1         13       <NA>   <NA>
##   ENST00000631435.1         12       <NA>   <NA>
##   ENST00000632684.1         12       <NA>   <NA>
##   ENST00000434970.2          9       <NA>   <NA>
##   ENST00000415118.1          8       <NA>   <NA>
##   ...                      ...        ...    ...
##   ENST00000454398.1        229       <NA>   <NA>
##   ENST00000457676.1        421       <NA>   <NA>
##   ENST00000457750.1       1680       <NA>   <NA>
##   ENST00000520423.2        412       <NA>   <NA>
##   ENST00000523715.2        472       <NA>   <NA>
txs <- seqnames(seqinfo(txome))
head(txs)
## [1] "ENST00000448914.1" "ENST00000631435.1" "ENST00000632684.1" "ENST00000434970.2"
## [5] "ENST00000415118.1" "ENST00000633010.1"

In order to extract a transcript’s sequence, we use the transcript name and we need to know the length of the transcript. We pick a transcript at random.

si <- seqinfo(txome)
idx <- 5000
txp_name <- seqnames(si)[idx]
txp_name
## [1] "ENST00000560761.5"
txp_len <- seqlengths(si)[idx]
txp_len
## ENST00000560761.5 
##               657
txp_seq <- getSeq(txome, GRanges(seqnames(si)[idx], IRanges(1, txp_len)))
txp_seq
## DNAStringSet object of length 1:
##     width seq
## [1]   657 AGATATCCTTAGCTAGTGTAACAATGTTGAAAGTGATGTGAGAA...GGGCAAGTCCACCATCACGATCGTGGCCGAGGAGGTATCAGGC

The above method is fairly easy for getting a transcript sequence if we know the cDNA file exists for the same genome and gene version. Another way, if we only have a GRangesList for the transcripts (not genes) is to use the extractTranscriptSeqs function from the GenomicFeatures package. This function takes two arguments: a BSgenome object and the GRangesList of the exons-by-transcript.

suppressPackageStartupMessages(library(EnsDb.Hsapiens.v86))
edb <- EnsDb.Hsapiens.v86

The following line of code can take some time, so I save the file for future use.

if (file.exists("ebt.rda")) {
  load("ebt.rda")
} else {
  ebt <- exonsBy(edb, by="tx") # exons-by-transcript
  save(ebt, file="ebt.rda")
}
txp_name_no_version <- sub("\\..*","",txp_name) # cut version number
txp <- ebt[txp_name_no_version]
library(GenomicFeatures)
# another option here would be ensembldb:::getGenomeTwoBitFile()
txp_seq_extracted <- extractTranscriptSeqs(Hsapiens, txp)
txp_seq_extracted
## DNAStringSet object of length 1:
##     width seq                                                                   names               
## [1]   657 AGATATCCTTAGCTAGTGTAACAATGTTGAAAG...ACCATCACGATCGTGGCCGAGGAGGTATCAGGC ENST00000560761
txp_seq == txp_seq_extracted
## [1] TRUE
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] EnsDb.Hsapiens.v86_2.99.0              ensembldb_2.21.2                      
##  [3] AnnotationFilter_1.21.0                GenomicFeatures_1.49.5                
##  [5] AnnotationDbi_1.59.1                   Biobase_2.57.1                        
##  [7] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000 BSgenome_1.65.2                       
##  [9] rtracklayer_1.57.0                     GenomicRanges_1.49.0                  
## [11] AnnotationHub_3.5.0                    BiocFileCache_2.5.0                   
## [13] dbplyr_2.2.1                           Biostrings_2.65.1                     
## [15] GenomeInfoDb_1.33.3                    XVector_0.37.0                        
## [17] IRanges_2.31.0                         S4Vectors_0.35.1                      
## [19] BiocGenerics_0.43.0                    pkgdown_2.0.5                         
## [21] testthat_3.1.4                         rmarkdown_2.14                        
## [23] devtools_2.4.3                         usethis_2.1.6                         
## 
## loaded via a namespace (and not attached):
##  [1] ProtGenerics_1.29.0           matrixStats_0.62.0            bitops_1.0-7                 
##  [4] fs_1.5.2                      bit64_4.0.5                   progress_1.2.2               
##  [7] filelock_1.0.2                httr_1.4.3                    tools_4.2.1                  
## [10] bslib_0.3.1                   utf8_1.2.2                    R6_2.5.1                     
## [13] lazyeval_0.2.2                DBI_1.1.3                     tidyselect_1.1.2             
## [16] prettyunits_1.1.1             processx_3.6.1                bit_4.0.4                    
## [19] curl_4.3.2                    compiler_4.2.1                cli_3.3.0                    
## [22] xml2_1.3.3                    DelayedArray_0.23.0           sass_0.4.1                   
## [25] callr_3.7.0                   rappdirs_0.3.3                stringr_1.4.0                
## [28] digest_0.6.29                 Rsamtools_2.13.3              pkgconfig_2.0.3              
## [31] htmltools_0.5.2               sessioninfo_1.2.2             MatrixGenerics_1.9.1         
## [34] fastmap_1.1.0                 rlang_1.0.3                   rstudioapi_0.13              
## [37] RSQLite_2.2.14                shiny_1.7.1                   jquerylib_0.1.4              
## [40] BiocIO_1.7.1                  generics_0.1.3                jsonlite_1.8.0               
## [43] BiocParallel_1.31.10          dplyr_1.0.9                   RCurl_1.98-1.7               
## [46] magrittr_2.0.3                GenomeInfoDbData_1.2.8        Matrix_1.4-1                 
## [49] Rcpp_1.0.8.3                  fansi_1.0.3                   lifecycle_1.0.1              
## [52] stringi_1.7.6                 yaml_2.3.5                    SummarizedExperiment_1.27.1  
## [55] zlibbioc_1.43.0               brio_1.1.3                    pkgbuild_1.3.1               
## [58] grid_4.2.1                    blob_1.2.3                    parallel_4.2.1               
## [61] promises_1.2.0.1              crayon_1.5.1                  lattice_0.20-45              
## [64] hms_1.1.1                     KEGGREST_1.37.2               knitr_1.39                   
## [67] ps_1.7.1                      pillar_1.7.0                  rjson_0.2.21                 
## [70] biomaRt_2.53.2                codetools_0.2-18              pkgload_1.3.0                
## [73] XML_3.99-0.10                 glue_1.6.2                    BiocVersion_3.16.0           
## [76] evaluate_0.15                 remotes_2.4.2                 BiocManager_1.30.18          
## [79] vctrs_0.4.1                   png_0.1-7                     httpuv_1.6.5                 
## [82] purrr_0.3.4                   assertthat_0.2.1              cachem_1.0.6                 
## [85] xfun_0.31                     mime_0.12                     xtable_1.8-4                 
## [88] restfulr_0.0.15               later_1.3.0                   tibble_3.1.7                 
## [91] GenomicAlignments_1.33.0      memoise_2.0.1                 ellipsis_0.3.2               
## [94] interactiveDisplayBase_1.35.0