library(Biostrings)
<- "AAACGCG"
dna class(dna)
[1] "character"
In this last section, we combine what we learned about ranges and annotation in order to obtain sequences of DNA from a genome.
First we introduce the Bioconductor package that we use to manipulate DNA strings (also RNA strings or amino acid strings). First, I load the package, and create a short character string with some letters of DNA:
library(Biostrings)
<- "AAACGCG"
dna class(dna)
[1] "character"
A simple character string in R doesn’t help us much for working with DNA. We can subset
out parts of it using base R, but there’s a not of functionality missing. Note what we can do once we turn it into a DNAString object. The reverse complement of a string is the letters you would see on the antisense strand of DNA, going from right to left. This is a useful function; for example, if we have a gene that is transcribed on the - strand, the RNA that is produced will be the reverse complement of the DNA on the + strand (this is the DNA that you see in the reference genome, which we will load later).
<- DNAString("AAACGCG")
dna reverseComplement(dna)
7-letter DNAString object
seq: CGCGTTT
A DNAStringSet is a vector of DNAStrings. If we want to know how many strings, we use length
, whereas if we want to know how many letters per DNAString, we use width
:
<- DNAStringSet(c("AAAAA","CGCG","TCG"))
dna dna
DNAStringSet object of length 3:
width seq
[1] 5 AAAAA
[2] 4 CGCG
[3] 3 TCG
length(dna)
[1] 3
width(dna)
[1] 5 4 3
There are a number of functions in the Biostrings package to help quickly tabulate things like letter frequency in the strings:
letterFrequency(dna, "C")
C
[1,] 0
[2,] 2
[3,] 1
letterFrequency(dna, "C", as.prob=TRUE)
C
[1,] 0.0000000
[2,] 0.5000000
[3,] 0.3333333
letterFrequency(dna, "CG", as.prob=TRUE)
C|G
[1,] 0.0000000
[2,] 1.0000000
[3,] 0.6666667
We can also look for matches of a query in a set of strings, using the function vmatchPattern
. The elementNROWS
function lets us know how many matches were found per string in dna
. And then we can see where the matches begin and end by looking into individual elements. Note that the matches are described as IRanges
.
<- DNAStringSet(c("AACTCTC","CTCTAAA","AAAGAG"))
dna <- vmatchPattern("CTC", dna)
matches elementNROWS(matches)
[1] 2 1 0
1]] matches[[
IRanges object with 2 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 3 5 3
[2] 5 7 3
The opposite setup: we can look for a set of queries within a longer string, using matchPDict
. These function may sound like they have funny names but the names often have to do with the data structures or algorithms used to quickly compute the matches.
<- DNAString("AAACTCAAAGAGAAATTTAAA")
dna <- PDict(c("CTC","GAG","TTT","AAA"))
pd <- matchPDict(pd, dna)
matches elementNROWS(matches)
[1] 1 1 1 4
4]] matches[[
IRanges object with 4 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 1 3 3
[2] 7 9 3
[3] 13 15 3
[4] 19 21 3
These functions from Biostrings are fast when run in vectorized fashion (providing a vector of strings to match). Note that if you had millions of strings to match against a reference (e.g. short read RNA-seq), there are dedicated command line tools to perform such mapping/alignment (and which may take into account introns when aligning RNA reads to the genome). We will discuss these tools in a later lecture. There are also Bioconductor packages which can perform alignment, including Rsubread and gmapR.
Many genomes exist as packages on Bioconductor, for example:
We will work with the UCSC version of the hg38/GRCh38 genome, the second package above. This package is ~670 Mb, so you may want to keep track of where it is kept on your computer. You can look up the library path using .libPaths()
, and you can remove packages with remove.packages()
.
But first, I want to show you also how to work with genomes that don’t exist as a package on Bioconductor. The first step would be to load up the AnnotationHub package:
library(AnnotationHub)
Loading required package: BiocFileCache
Loading required package: dbplyr
<- AnnotationHub() ah
As before, you can look through a web browser to find a particular genome. For example, try searching for species: Homo sapiens, genome: grch38, description: FASTA. Notice the different descriptions. Note that “cDNA” stands for the sequence that is transcribed into RNA (so the exon sequence of transcripts).
display(ah)
We can also query the AnnotationHub using the query
function. Suppose we wanted the GRCh38 human genome (though we have it above in a package). We look for the files labeled dna.primary_assembly
:
<- query(ah, c("ensembl","GRCh38","dna.primary_assembly"))
res head(res$sourceurl)
[1] "ftp://ftp.ensembl.org/pub/release-82/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
[2] "ftp://ftp.ensembl.org/pub/release-83/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
[3] "ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
[4] "ftp://ftp.ensembl.org/pub/release-85/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
[5] "ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
[6] "ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
We could then download one of these files using double square brackets, to pick which of the genomes in res
. We won’t do that. Instead, we will work with the package described above, which can be downloaded with Bioconductor’s installation function.
suppressPackageStartupMessages(library(BSgenome.Hsapiens.NCBI.GRCh38))
Hsapiens
| BSgenome object for Human
| - organism: Homo sapiens
| - provider: NCBI
| - genome: GRCh38
| - release date: 2013-12-17
| - 455 sequence(s):
| 1 2
| 3 4
| 5 6
| 7 8
| 9 10
| ... ...
| HSCHR19KIR_FH05_B_HAP_CTG3_1 HSCHR19KIR_FH06_A_HAP_CTG3_1
| HSCHR19KIR_FH06_BA1_HAP_CTG3_1 HSCHR19KIR_FH08_A_HAP_CTG3_1
| HSCHR19KIR_FH08_BAX_HAP_CTG3_1 HSCHR19KIR_FH13_A_HAP_CTG3_1
| HSCHR19KIR_FH13_BA2_HAP_CTG3_1 HSCHR19KIR_FH15_A_HAP_CTG3_1
| HSCHR19KIR_RP5_B_HAP_CTG3_1
|
| Tips: call 'seqnames()' on the object to get all the sequence names, call 'seqinfo()' to get the
| full sequence info, use the '$' or '[[' operator to access a given sequence, see '?BSgenome' for
| more information.
We can extract particular sequences from a reference genome using GRanges. Here we choose the NCBI
version of the genome. Note that the difference between the “UCSC” and “NCBI/Ensembl” style of chromosomes is chr1
vs 1
.
seqinfo(Hsapiens)
Seqinfo object with 455 sequences (1 circular) from GRCh38 genome:
seqnames seqlengths isCircular genome
1 248956422 FALSE GRCh38
2 242193529 FALSE GRCh38
3 198295559 FALSE GRCh38
4 190214555 FALSE GRCh38
5 181538259 FALSE GRCh38
... ... ... ...
HSCHR19KIR_FH08_BAX_HAP_CTG3_1 200773 FALSE GRCh38
HSCHR19KIR_FH13_A_HAP_CTG3_1 170148 FALSE GRCh38
HSCHR19KIR_FH13_BA2_HAP_CTG3_1 215732 FALSE GRCh38
HSCHR19KIR_FH15_A_HAP_CTG3_1 170537 FALSE GRCh38
HSCHR19KIR_RP5_B_HAP_CTG3_1 177381 FALSE GRCh38
<- GRanges("1", IRanges(1e6 + c(1,101,201), width=100), strand="-")
gr gr
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 1000001-1000100 -
[2] 1 1000101-1000200 -
[3] 1 1000201-1000300 -
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
<- getSeq(Hsapiens, gr)
dna dna
DNAStringSet object of length 3:
width seq
[1] 100 CGCGCGGGCCTGGAGCCGGGATCCGCCCTAGGGGCTCGGATCGC...CTCGCCCGCCAGCCCGCCCGTGGTCCGTGGCGGCGCGCTCCAC
[2] 100 GCGGCCGTCGATTGGCTGGGGCGAGTGTGGGAAAGAATGCGGAG...CCCCGCGGCGGCGAGGCCTTAAATAGGGAAACGGCCTGAGGCG
[3] 100 GTGTCGGAGGGGGGCGCGGGCGGGGGCCCCGCATGCCATTGTCG...CGCCCGGAGCCGAGGCGTGACTGACAGCGAGCGGGGGAGGAGC
We can download the cDNA (“complementary DNA”), which is the sequence of the exons for each transcript. Note that, if the gene is on the - strand, then the cDNA will be the reverse complement of the reference genome, the Hsapiens
object above.
<- query(ah, c("ensembl","GRCh38","cdna.all","release-86"))[[1]] txome
loading from cache
The seqnames
in this case are not the chromosomes, they are the names of the transcripts, and there are 178,000 of them. Note that the transcript names have a .
and an extra number on the end. This is a version number (if you noticed, we saw these version numbers previously on the rowRanges
of the human airway epithelial data in the SummarizedExperiment lecture note).
We will have to chop this extra number off later in order to match with a set of transcript identifiers that do not have the version number.
seqinfo(txome)
Seqinfo object with 178136 sequences from an unspecified genome:
seqnames seqlengths isCircular genome
ENST00000448914.1 13 <NA> <NA>
ENST00000631435.1 12 <NA> <NA>
ENST00000632684.1 12 <NA> <NA>
ENST00000434970.2 9 <NA> <NA>
ENST00000415118.1 8 <NA> <NA>
... ... ... ...
ENST00000454398.1 229 <NA> <NA>
ENST00000457676.1 421 <NA> <NA>
ENST00000457750.1 1680 <NA> <NA>
ENST00000520423.2 412 <NA> <NA>
ENST00000523715.2 472 <NA> <NA>
<- seqnames(seqinfo(txome))
txs head(txs)
[1] "ENST00000448914.1" "ENST00000631435.1" "ENST00000632684.1" "ENST00000434970.2"
[5] "ENST00000415118.1" "ENST00000633010.1"
In order to extract a transcript’s sequence, we use the transcript name and we need to know the length of the transcript. We pick a transcript at random.
<- seqinfo(txome)
si <- 5000
idx <- seqnames(si)[idx]
txp_name txp_name
[1] "ENST00000560761.5"
<- seqlengths(si)[idx]
txp_len txp_len
ENST00000560761.5
657
<- getSeq(txome, GRanges(seqnames(si)[idx], IRanges(1, txp_len)))
txp_seq txp_seq
DNAStringSet object of length 1:
width seq
[1] 657 AGATATCCTTAGCTAGTGTAACAATGTTGAAAGTGATGTGAGAA...GGGCAAGTCCACCATCACGATCGTGGCCGAGGAGGTATCAGGC
The above method is fairly easy for getting a transcript sequence if we know the cDNA file exists for the same genome and gene version. Another way, if we only have a GRangesList for the transcripts (not genes) is to use the extractTranscriptSeqs
function from the GenomicFeatures package. This function takes two arguments: a BSgenome object and the GRangesList of the exons-by-transcript.
suppressPackageStartupMessages(library(EnsDb.Hsapiens.v86))
<- EnsDb.Hsapiens.v86 edb
The following line of code can take some time, so I save the file for future use.
library(here)
here() starts at /Users/milove/teach/compbio/compbio_src
<- here("bioc","ebt_cbt.rda")
file_path if (file.exists(file_path)) {
load(file_path)
else {
} <- exonsBy(edb, by="tx") # exons-by-transcript
ebt <- cdsBy(edb, by="tx") # coding-seq-by-transcript
cbt save(ebt, cbt, file=file_path)
}
<- sub("\\..*","",txp_name) # cut version number
txp_name_no_version <- ebt[txp_name_no_version]
txp library(GenomicFeatures)
# another option here would be ensembldb:::getGenomeTwoBitFile()
<- extractTranscriptSeqs(Hsapiens, txp)
txp_seq_extracted txp_seq_extracted
DNAStringSet object of length 1:
width seq names
[1] 657 AGATATCCTTAGCTAGTGTAACAATGTTGAAAG...ACCATCACGATCGTGGCCGAGGAGGTATCAGGC ENST00000560761
== txp_seq_extracted txp_seq
[1] TRUE
Note that we can evaluate the peptide sequence using the translate
function, however we have to be sure to start with the coding sequence and not include the UTRs (untranslated regions).
<- cbt[txp_name_no_version]
coding_loc <- extractTranscriptSeqs(Hsapiens, coding_loc)
coding_seq_extracted ::translate(coding_seq_extracted) Biostrings
AAStringSet object of length 1:
width seq names
[1] 149 MSDPEMGWVPEPPTMTLGASRVELRVSCHGLLD...IVSQTKVTKPLLLKNGKTAGKSTITIVAEEVSG ENST00000560761
How could you evalute the consequence of a SNP in a coding sequence?
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods base
other attached packages:
[1] here_1.0.1 EnsDb.Hsapiens.v86_2.99.0
[3] ensembldb_2.28.0 AnnotationFilter_1.28.0
[5] GenomicFeatures_1.56.0 AnnotationDbi_1.66.0
[7] Biobase_2.64.0 BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000
[9] BSgenome_1.72.0 rtracklayer_1.64.0
[11] BiocIO_1.14.0 GenomicRanges_1.56.1
[13] AnnotationHub_3.12.0 BiocFileCache_2.12.0
[15] dbplyr_2.5.0 Biostrings_2.72.1
[17] GenomeInfoDb_1.40.1 XVector_0.44.0
[19] IRanges_2.38.1 S4Vectors_0.42.1
[21] BiocGenerics_0.50.0 testthat_3.2.1.1
[23] rmarkdown_2.27 devtools_2.4.5
[25] usethis_3.0.0
loaded via a namespace (and not attached):
[1] DBI_1.2.3 bitops_1.0-8 remotes_2.5.0
[4] rlang_1.1.4 magrittr_2.0.3 matrixStats_1.3.0
[7] compiler_4.4.1 RSQLite_2.3.7 png_0.1-8
[10] vctrs_0.6.5 ProtGenerics_1.36.0 stringr_1.5.1
[13] profvis_0.3.8 pkgconfig_2.0.3 crayon_1.5.3
[16] fastmap_1.2.0 ellipsis_0.3.2 utf8_1.2.4
[19] Rsamtools_2.20.0 promises_1.3.0 sessioninfo_1.2.2
[22] UCSC.utils_1.0.0 purrr_1.0.2 bit_4.0.5
[25] xfun_0.46 zlibbioc_1.50.0 cachem_1.1.0
[28] jsonlite_1.8.8 blob_1.2.4 later_1.3.2
[31] DelayedArray_0.30.1 BiocParallel_1.38.0 parallel_4.4.1
[34] R6_2.5.1 stringi_1.8.4 pkgload_1.4.0
[37] brio_1.1.5 Rcpp_1.0.13 SummarizedExperiment_1.34.0
[40] knitr_1.48 httpuv_1.6.15 Matrix_1.7-0
[43] tidyselect_1.2.1 rstudioapi_0.16.0 abind_1.4-5
[46] yaml_2.3.10 codetools_0.2-20 miniUI_0.1.1.1
[49] curl_5.2.1 pkgbuild_1.4.4 lattice_0.22-6
[52] tibble_3.2.1 shiny_1.9.1 withr_3.0.1
[55] KEGGREST_1.44.1 evaluate_0.24.0 urlchecker_1.0.1
[58] pillar_1.9.0 BiocManager_1.30.23 filelock_1.0.3
[61] MatrixGenerics_1.16.0 generics_0.1.3 rprojroot_2.0.4
[64] RCurl_1.98-1.16 BiocVersion_3.19.1 xtable_1.8-4
[67] glue_1.7.0 lazyeval_0.2.2 tools_4.4.1
[70] GenomicAlignments_1.40.0 fs_1.6.4 XML_3.99-0.17
[73] grid_4.4.1 GenomeInfoDbData_1.2.12 restfulr_0.0.15
[76] cli_3.6.3 rappdirs_0.3.3 fansi_1.0.6
[79] S4Arrays_1.4.1 dplyr_1.1.4 digest_0.6.36
[82] SparseArray_1.4.8 rjson_0.2.21 htmlwidgets_1.6.4
[85] memoise_2.0.1 htmltools_0.5.8.1 lifecycle_1.0.4
[88] httr_1.4.7 mime_0.12 bit64_4.0.5