In this last section, we combine what we learned about ranges and annotation in order to obtain sequences of DNA from a genome.
First we introduce the Bioconductor package that we use to manipulate DNA strings (also RNA strings or amino acid strings). First, I load the package, and create a short character string with some letters of DNA:
library(Biostrings)
dna <- "AAACGCG"
class(dna)
## [1] "character"
A simple character string in R doesn’t help us much for working with
DNA. We can subset
out parts of it using base R, but
there’s a not of functionality missing. Note what we can do once we turn
it into a DNAString object. The reverse complement of a string
is the letters you would see on the antisense strand of DNA, going from
right to left. This is a useful function; for example, if we have a gene
that is transcribed on the - strand, the RNA that is produced will be
the reverse complement of the DNA on the + strand (this is the DNA that
you see in the reference genome, which we will load later).
dna <- DNAString("AAACGCG")
reverseComplement(dna)
## 7-letter DNAString object
## seq: CGCGTTT
A DNAStringSet is a vector of DNAStrings. If we want to know
how many strings, we use length
, whereas if we want to know
how many letters per DNAString, we use width
:
dna <- DNAStringSet(c("AAAAA","CGCG","TCG"))
dna
## DNAStringSet object of length 3:
## width seq
## [1] 5 AAAAA
## [2] 4 CGCG
## [3] 3 TCG
length(dna)
## [1] 3
width(dna)
## [1] 5 4 3
There are a number of functions in the Biostrings package to help quickly tabulate things like letter frequency in the strings:
letterFrequency(dna, "C")
## C
## [1,] 0
## [2,] 2
## [3,] 1
letterFrequency(dna, "C", as.prob=TRUE)
## C
## [1,] 0.0000000
## [2,] 0.5000000
## [3,] 0.3333333
letterFrequency(dna, "CG", as.prob=TRUE)
## C|G
## [1,] 0.0000000
## [2,] 1.0000000
## [3,] 0.6666667
We can also look for matches of a query in a set of strings, using
the function vmatchPattern
. The elementNROWS
function lets us know how many matches were found per string in
dna
. And then we can see where the matches begin and end by
looking into individual elements. Note that the matches are described as
IRanges
.
dna <- DNAStringSet(c("AACTCTC","CTCTAAA","AAAGAG"))
matches <- vmatchPattern("CTC", dna)
elementNROWS(matches)
## [1] 2 1 0
matches[[1]]
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 3 5 3
## [2] 5 7 3
The opposite setup: we can look for a set of queries within a longer
string, using matchPDict
. These function may sound like
they have funny names but the names often have to do with the data
structures or algorithms used to quickly compute the matches.
dna <- DNAString("AAACTCAAAGAGAAATTTAAA")
pd <- PDict(c("CTC","GAG","TTT","AAA"))
matches <- matchPDict(pd, dna)
elementNROWS(matches)
## [1] 1 1 1 4
matches[[4]]
## IRanges object with 4 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 3 3
## [2] 7 9 3
## [3] 13 15 3
## [4] 19 21 3
These functions from Biostrings are fast when run in vectorized fashion (providing a vector of strings to match). Note that if you had millions of strings to match against a reference (e.g. short read RNA-seq), there are dedicated command line tools to perform such mapping/alignment (and which may take into account introns when aligning RNA reads to the genome). We will discuss these tools in a later lecture. There are also Bioconductor packages which can perform alignment, including Rsubread and gmapR.
Many genomes exist as packages on Bioconductor, for example:
We will work with the UCSC version of the hg38/GRCh38 genome, the
second package above. This package is ~670 Mb, so you may want to keep
track of where it is kept on your computer. You can look up the library
path using .libPaths()
, and you can remove packages with
remove.packages()
.
But first, I want to show you also how to work with genomes that don’t exist as a package on Bioconductor. The first step would be to load up the AnnotationHub package:
library(AnnotationHub)
## Loading required package: BiocFileCache
## Loading required package: dbplyr
ah <- AnnotationHub()
## snapshotDate(): 2022-09-09
As before, you can look through a web browser to find a particular genome. For example, try searching for species: Homo sapiens, genome: grch38, description: FASTA. Notice the different descriptions. Note that “cDNA” stands for the sequence that is transcribed into RNA (so the exon sequence of transcripts).
display(ah)
We can also query the AnnotationHub using the query
function. Suppose we wanted the GRCh38 human genome (though we have it
above in a package). We look for the files labeled
dna.primary_assembly
:
res <- query(ah, c("ensembl","GRCh38","dna.primary_assembly"))
head(res$sourceurl)
## [1] "ftp://ftp.ensembl.org/pub/release-82/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
## [2] "ftp://ftp.ensembl.org/pub/release-83/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
## [3] "ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
## [4] "ftp://ftp.ensembl.org/pub/release-85/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
## [5] "ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
## [6] "ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
We could then download one of these files using double square
brackets, to pick which of the genomes in res
. We won’t do
that. Instead, we will work with the package described above, which can
be downloaded with Bioconductor’s installation function.
suppressPackageStartupMessages(library(BSgenome.Hsapiens.NCBI.GRCh38))
Hsapiens
## Human genome:
## # organism: Homo sapiens (Human)
## # genome: GRCh38
## # provider: NCBI
## # release date: 2013-12-17
## # 455 sequences:
## # 1 2
## # 3 4
## # 5 6
## # 7 8
## # 9 10
## # ... ...
## # HSCHR19KIR_FH05_B_HAP_CTG3_1 HSCHR19KIR_FH06_A_HAP_CTG3_1
## # HSCHR19KIR_FH06_BA1_HAP_CTG3_1 HSCHR19KIR_FH08_A_HAP_CTG3_1
## # HSCHR19KIR_FH08_BAX_HAP_CTG3_1 HSCHR19KIR_FH13_A_HAP_CTG3_1
## # HSCHR19KIR_FH13_BA2_HAP_CTG3_1 HSCHR19KIR_FH15_A_HAP_CTG3_1
## # HSCHR19KIR_RP5_B_HAP_CTG3_1
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator to access a given
## # sequence)
We can extract particular sequences from a reference genome using
GRanges. Here we choose the NCBI
version of the
genome. Note that the difference between the “UCSC” and “NCBI/Ensembl”
style of chromosomes is chr1
vs 1
.
seqinfo(Hsapiens)
## Seqinfo object with 455 sequences (1 circular) from GRCh38 genome:
## seqnames seqlengths isCircular genome
## 1 248956422 FALSE GRCh38
## 2 242193529 FALSE GRCh38
## 3 198295559 FALSE GRCh38
## 4 190214555 FALSE GRCh38
## 5 181538259 FALSE GRCh38
## ... ... ... ...
## HSCHR19KIR_FH08_BAX_HAP_CTG3_1 200773 FALSE GRCh38
## HSCHR19KIR_FH13_A_HAP_CTG3_1 170148 FALSE GRCh38
## HSCHR19KIR_FH13_BA2_HAP_CTG3_1 215732 FALSE GRCh38
## HSCHR19KIR_FH15_A_HAP_CTG3_1 170537 FALSE GRCh38
## HSCHR19KIR_RP5_B_HAP_CTG3_1 177381 FALSE GRCh38
gr <- GRanges("1", IRanges(1e6 + c(1,101,201), width=100))
gr
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 1 1000001-1000100 *
## [2] 1 1000101-1000200 *
## [3] 1 1000201-1000300 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
dna <- getSeq(Hsapiens, gr)
dna
## DNAStringSet object of length 3:
## width seq
## [1] 100 GTGGAGCGCGCCGCCACGGACCACGGGCGGGCTGGCGGGCGAGC...CGATCCGAGCCCCTAGGGCGGATCCCGGCTCCAGGCCCGCGCG
## [2] 100 CGCCTCAGGCCGTTTCCCTATTTAAGGCCTCGCCGCCGCGGGGT...TCCGCATTCTTTCCCACACTCGCCCCAGCCAATCGACGGCCGC
## [3] 100 GCTCCTCCCCCGCTCGCTGTCAGTCACGCCTCGGCTCCGGGCGC...GACAATGGCATGCGGGGCCCCCGCCCGCGCCCCCCTCCGACAC
We can download the cDNA (“complementary DNA”), which is the sequence
of the exons for each transcript. Note that, if the gene is on the -
strand, then the cDNA will be the reverse complement of the reference
genome, the Hsapiens
object above.
txome <- query(ah, c("ensembl","GRCh38","cdna.all","release-86"))[[1]]
## loading from cache
The seqnames
in this case are not the chromosomes, they
are the names of the transcripts, and there are 178,000 of them. Note
that the transcript names have a .
and an extra number on
the end. This is a version number (if you noticed, we saw these version
numbers previously on the rowRanges
of the human airway
epithelial data in the SummarizedExperiment
lecture note).
We will have to chop this extra number off later in order to match with a set of transcript identifiers that do not have the version number.
seqinfo(txome)
## Seqinfo object with 178136 sequences from an unspecified genome:
## seqnames seqlengths isCircular genome
## ENST00000448914.1 13 <NA> <NA>
## ENST00000631435.1 12 <NA> <NA>
## ENST00000632684.1 12 <NA> <NA>
## ENST00000434970.2 9 <NA> <NA>
## ENST00000415118.1 8 <NA> <NA>
## ... ... ... ...
## ENST00000454398.1 229 <NA> <NA>
## ENST00000457676.1 421 <NA> <NA>
## ENST00000457750.1 1680 <NA> <NA>
## ENST00000520423.2 412 <NA> <NA>
## ENST00000523715.2 472 <NA> <NA>
txs <- seqnames(seqinfo(txome))
head(txs)
## [1] "ENST00000448914.1" "ENST00000631435.1" "ENST00000632684.1" "ENST00000434970.2"
## [5] "ENST00000415118.1" "ENST00000633010.1"
In order to extract a transcript’s sequence, we use the transcript name and we need to know the length of the transcript. We pick a transcript at random.
si <- seqinfo(txome)
idx <- 5000
txp_name <- seqnames(si)[idx]
txp_name
## [1] "ENST00000560761.5"
txp_len <- seqlengths(si)[idx]
txp_len
## ENST00000560761.5
## 657
txp_seq <- getSeq(txome, GRanges(seqnames(si)[idx], IRanges(1, txp_len)))
txp_seq
## DNAStringSet object of length 1:
## width seq
## [1] 657 AGATATCCTTAGCTAGTGTAACAATGTTGAAAGTGATGTGAGAA...GGGCAAGTCCACCATCACGATCGTGGCCGAGGAGGTATCAGGC
The above method is fairly easy for getting a transcript sequence if
we know the cDNA file exists for the same genome and gene version.
Another way, if we only have a GRangesList for the transcripts
(not genes) is to use the extractTranscriptSeqs
function
from the GenomicFeatures package. This function takes two arguments: a
BSgenome object and the GRangesList of the
exons-by-transcript.
suppressPackageStartupMessages(library(EnsDb.Hsapiens.v86))
edb <- EnsDb.Hsapiens.v86
The following line of code can take some time, so I save the file for future use.
if (file.exists("ebt.rda")) {
load("ebt.rda")
} else {
ebt <- exonsBy(edb, by="tx") # exons-by-transcript
save(ebt, file="ebt.rda")
}
txp_name_no_version <- sub("\\..*","",txp_name) # cut version number
txp <- ebt[txp_name_no_version]
library(GenomicFeatures)
# another option here would be ensembldb:::getGenomeTwoBitFile()
txp_seq_extracted <- extractTranscriptSeqs(Hsapiens, txp)
txp_seq_extracted
## DNAStringSet object of length 1:
## width seq names
## [1] 657 AGATATCCTTAGCTAGTGTAACAATGTTGAAAG...ACCATCACGATCGTGGCCGAGGAGGTATCAGGC ENST00000560761
txp_seq == txp_seq_extracted
## [1] TRUE
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.21.2
## [3] AnnotationFilter_1.21.0 GenomicFeatures_1.49.5
## [5] AnnotationDbi_1.59.1 Biobase_2.57.1
## [7] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000 BSgenome_1.65.2
## [9] rtracklayer_1.57.0 GenomicRanges_1.49.0
## [11] AnnotationHub_3.5.0 BiocFileCache_2.5.0
## [13] dbplyr_2.2.1 Biostrings_2.65.1
## [15] GenomeInfoDb_1.33.3 XVector_0.37.0
## [17] IRanges_2.31.0 S4Vectors_0.35.1
## [19] BiocGenerics_0.43.0 pkgdown_2.0.5
## [21] testthat_3.1.4 rmarkdown_2.14
## [23] devtools_2.4.3 usethis_2.1.6
##
## loaded via a namespace (and not attached):
## [1] ProtGenerics_1.29.0 matrixStats_0.62.0 bitops_1.0-7
## [4] fs_1.5.2 bit64_4.0.5 progress_1.2.2
## [7] filelock_1.0.2 httr_1.4.3 tools_4.2.1
## [10] bslib_0.3.1 utf8_1.2.2 R6_2.5.1
## [13] lazyeval_0.2.2 DBI_1.1.3 tidyselect_1.1.2
## [16] prettyunits_1.1.1 processx_3.6.1 bit_4.0.4
## [19] curl_4.3.2 compiler_4.2.1 cli_3.3.0
## [22] xml2_1.3.3 DelayedArray_0.23.0 sass_0.4.1
## [25] callr_3.7.0 rappdirs_0.3.3 stringr_1.4.0
## [28] digest_0.6.29 Rsamtools_2.13.3 pkgconfig_2.0.3
## [31] htmltools_0.5.2 sessioninfo_1.2.2 MatrixGenerics_1.9.1
## [34] fastmap_1.1.0 rlang_1.0.3 rstudioapi_0.13
## [37] RSQLite_2.2.14 shiny_1.7.1 jquerylib_0.1.4
## [40] BiocIO_1.7.1 generics_0.1.3 jsonlite_1.8.0
## [43] BiocParallel_1.31.10 dplyr_1.0.9 RCurl_1.98-1.7
## [46] magrittr_2.0.3 GenomeInfoDbData_1.2.8 Matrix_1.4-1
## [49] Rcpp_1.0.8.3 fansi_1.0.3 lifecycle_1.0.1
## [52] stringi_1.7.6 yaml_2.3.5 SummarizedExperiment_1.27.1
## [55] zlibbioc_1.43.0 brio_1.1.3 pkgbuild_1.3.1
## [58] grid_4.2.1 blob_1.2.3 parallel_4.2.1
## [61] promises_1.2.0.1 crayon_1.5.1 lattice_0.20-45
## [64] hms_1.1.1 KEGGREST_1.37.2 knitr_1.39
## [67] ps_1.7.1 pillar_1.7.0 rjson_0.2.21
## [70] biomaRt_2.53.2 codetools_0.2-18 pkgload_1.3.0
## [73] XML_3.99-0.10 glue_1.6.2 BiocVersion_3.16.0
## [76] evaluate_0.15 remotes_2.4.2 BiocManager_1.30.18
## [79] vctrs_0.4.1 png_0.1-7 httpuv_1.6.5
## [82] purrr_0.3.4 assertthat_0.2.1 cachem_1.0.6
## [85] xfun_0.31 mime_0.12 xtable_1.8-4
## [88] restfulr_0.0.15 later_1.3.0 tibble_3.1.7
## [91] GenomicAlignments_1.33.0 memoise_2.0.1 ellipsis_0.3.2
## [94] interactiveDisplayBase_1.35.0