Working with DNA sequences using Biostrings

Author

In this last section, we combine what we learned about ranges and annotation in order to obtain sequences of DNA from a genome.

First we introduce the Bioconductor package that we use to manipulate DNA strings (also RNA strings or amino acid strings). First, I load the package, and create a short character string with some letters of DNA:

library(Biostrings)
dna <- "AAACGCG"
class(dna)

[1] "character"

A simple character string in R doesn’t help us much for working with DNA. We can subset out parts of it using base R, but there’s a not of functionality missing. Note what we can do once we turn it into a DNAString object. The reverse complement of a string is the letters you would see on the antisense strand of DNA, going from right to left. This is a useful function; for example, if we have a gene that is transcribed on the - strand, the RNA that is produced will be the reverse complement of the DNA on the + strand (this is the DNA that you see in the reference genome, which we will load later).

dna <- DNAString("AAACGCG")
reverseComplement(dna)

7-letter DNAString object
seq: CGCGTTT

A DNAStringSet is a vector of DNAStrings. If we want to know how many strings, we use length, whereas if we want to know how many letters per DNAString, we use width:

dna <- DNAStringSet(c("AAAAA","CGCG","TCG"))
dna

DNAStringSet object of length 3:
    width seq
[1]     5 AAAAA
[2]     4 CGCG
[3]     3 TCG

length(dna)

[1] 3

width(dna)

[1] 5 4 3

There are a number of functions in the Biostrings package to help quickly tabulate things like letter frequency in the strings:

letterFrequency(dna, "C")

     C
[1,] 0
[2,] 2
[3,] 1

letterFrequency(dna, "C", as.prob=TRUE)

             C
[1,] 0.0000000
[2,] 0.5000000
[3,] 0.3333333

letterFrequency(dna, "CG", as.prob=TRUE)

           C|G
[1,] 0.0000000
[2,] 1.0000000
[3,] 0.6666667

We can also look for matches of a query in a set of strings, using the function vmatchPattern. The elementNROWS function lets us know how many matches were found per string in dna. And then we can see where the matches begin and end by looking into individual elements. Note that the matches are described as IRanges.

dna <- DNAStringSet(c("AACTCTC","CTCTAAA","AAAGAG"))
matches <- vmatchPattern("CTC", dna)
elementNROWS(matches)

[1] 2 1 0

matches[[1]]

IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         3         5         3
  [2]         5         7         3

The opposite setup: we can look for a set of queries within a longer string, using matchPDict. These function may sound like they have funny names but the names often have to do with the data structures or algorithms used to quickly compute the matches.

dna <- DNAString("AAACTCAAAGAGAAATTTAAA")
pd <- PDict(c("CTC","GAG","TTT","AAA"))
matches <- matchPDict(pd, dna)
elementNROWS(matches)

[1] 1 1 1 4

matches[[4]]

IRanges object with 4 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         3         3
  [2]         7         9         3
  [3]        13        15         3
  [4]        19        21         3

These functions from Biostrings are fast when run in vectorized fashion (providing a vector of strings to match). Note that if you had millions of strings to match against a reference (e.g. short read RNA-seq), there are dedicated command line tools to perform such mapping/alignment (and which may take into account introns when aligning RNA reads to the genome). We will discuss these tools in a later lecture. There are also Bioconductor packages which can perform alignment, including Rsubread and gmapR.

Working with genomes

Many genomes exist as packages on Bioconductor, for example:

We will work with the UCSC version of the hg38/GRCh38 genome, the second package above. This package is ~670 Mb, so you may want to keep track of where it is kept on your computer. You can look up the library path using .libPaths(), and you can remove packages with remove.packages().

But first, I want to show you also how to work with genomes that don’t exist as a package on Bioconductor. The first step would be to load up the AnnotationHub package:

library(AnnotationHub)

Loading required package: BiocFileCache

Loading required package: dbplyr

ah <- AnnotationHub()

As before, you can look through a web browser to find a particular genome. For example, try searching for species: Homo sapiens, genome: grch38, description: FASTA. Notice the different descriptions. Note that “cDNA” stands for the sequence that is transcribed into RNA (so the exon sequence of transcripts).

display(ah)

We can also query the AnnotationHub using the query function. Suppose we wanted the GRCh38 human genome (though we have it above in a package). We look for the files labeled dna.primary_assembly:

res <- query(ah, c("ensembl","GRCh38","dna.primary_assembly"))
head(res$sourceurl)

[1] "ftp://ftp.ensembl.org/pub/release-82/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
[2] "ftp://ftp.ensembl.org/pub/release-83/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
[3] "ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
[4] "ftp://ftp.ensembl.org/pub/release-85/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
[5] "ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
[6] "ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"

We could then download one of these files using double square brackets, to pick which of the genomes in res. We won’t do that. Instead, we will work with the package described above, which can be downloaded with Bioconductor’s installation function.

suppressPackageStartupMessages(library(BSgenome.Hsapiens.NCBI.GRCh38))
Hsapiens

| BSgenome object for Human
| - organism: Homo sapiens
| - provider: NCBI
| - genome: GRCh38
| - release date: 2013-12-17
| - 455 sequence(s):
|     1                                  2                                 
|     3                                  4                                 
|     5                                  6                                 
|     7                                  8                                 
|     9                                  10                                
|     ...                                ...                               
|     HSCHR19KIR_FH05_B_HAP_CTG3_1       HSCHR19KIR_FH06_A_HAP_CTG3_1      
|     HSCHR19KIR_FH06_BA1_HAP_CTG3_1     HSCHR19KIR_FH08_A_HAP_CTG3_1      
|     HSCHR19KIR_FH08_BAX_HAP_CTG3_1     HSCHR19KIR_FH13_A_HAP_CTG3_1      
|     HSCHR19KIR_FH13_BA2_HAP_CTG3_1     HSCHR19KIR_FH15_A_HAP_CTG3_1      
|     HSCHR19KIR_RP5_B_HAP_CTG3_1                                          
| 
| Tips: call 'seqnames()' on the object to get all the sequence names, call 'seqinfo()' to get the
| full sequence info, use the '$' or '[[' operator to access a given sequence, see '?BSgenome' for
| more information.

We can extract particular sequences from a reference genome using GRanges. Here we choose the NCBI version of the genome. Note that the difference between the “UCSC” and “NCBI/Ensembl” style of chromosomes is chr1 vs 1.

seqinfo(Hsapiens)

Seqinfo object with 455 sequences (1 circular) from GRCh38 genome:
  seqnames                       seqlengths isCircular genome
  1                               248956422      FALSE GRCh38
  2                               242193529      FALSE GRCh38
  3                               198295559      FALSE GRCh38
  4                               190214555      FALSE GRCh38
  5                               181538259      FALSE GRCh38
  ...                                   ...        ...    ...
  HSCHR19KIR_FH08_BAX_HAP_CTG3_1     200773      FALSE GRCh38
  HSCHR19KIR_FH13_A_HAP_CTG3_1       170148      FALSE GRCh38
  HSCHR19KIR_FH13_BA2_HAP_CTG3_1     215732      FALSE GRCh38
  HSCHR19KIR_FH15_A_HAP_CTG3_1       170537      FALSE GRCh38
  HSCHR19KIR_RP5_B_HAP_CTG3_1        177381      FALSE GRCh38

gr <- GRanges("1", IRanges(1e6 + c(1,101,201), width=100), strand="-")
gr

GRanges object with 3 ranges and 0 metadata columns:
      seqnames          ranges strand
         <Rle>       <IRanges>  <Rle>
  [1]        1 1000001-1000100      -
  [2]        1 1000101-1000200      -
  [3]        1 1000201-1000300      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

dna <- getSeq(Hsapiens, gr)
dna

DNAStringSet object of length 3:
    width seq
[1]   100 CGCGCGGGCCTGGAGCCGGGATCCGCCCTAGGGGCTCGGATCGC...CTCGCCCGCCAGCCCGCCCGTGGTCCGTGGCGGCGCGCTCCAC
[2]   100 GCGGCCGTCGATTGGCTGGGGCGAGTGTGGGAAAGAATGCGGAG...CCCCGCGGCGGCGAGGCCTTAAATAGGGAAACGGCCTGAGGCG
[3]   100 GTGTCGGAGGGGGGCGCGGGCGGGGGCCCCGCATGCCATTGTCG...CGCCCGGAGCCGAGGCGTGACTGACAGCGAGCGGGGGAGGAGC

Working with a transcriptome

We can download the cDNA (“complementary DNA”), which is the sequence of the exons for each transcript. Note that, if the gene is on the - strand, then the cDNA will be the reverse complement of the reference genome, the Hsapiens object above.

txome <- query(ah, c("ensembl","GRCh38","cdna.all","release-86"))[[1]]

loading from cache

The seqnames in this case are not the chromosomes, they are the names of the transcripts, and there are 178,000 of them. Note that the transcript names have a . and an extra number on the end. This is a version number (if you noticed, we saw these version numbers previously on the rowRanges of the human airway epithelial data in the SummarizedExperiment lecture note).

We will have to chop this extra number off later in order to match with a set of transcript identifiers that do not have the version number.

seqinfo(txome)

Seqinfo object with 178136 sequences from an unspecified genome:
  seqnames          seqlengths isCircular genome
  ENST00000448914.1         13       <NA>   <NA>
  ENST00000631435.1         12       <NA>   <NA>
  ENST00000632684.1         12       <NA>   <NA>
  ENST00000434970.2          9       <NA>   <NA>
  ENST00000415118.1          8       <NA>   <NA>
  ...                      ...        ...    ...
  ENST00000454398.1        229       <NA>   <NA>
  ENST00000457676.1        421       <NA>   <NA>
  ENST00000457750.1       1680       <NA>   <NA>
  ENST00000520423.2        412       <NA>   <NA>
  ENST00000523715.2        472       <NA>   <NA>

txs <- seqnames(seqinfo(txome))
head(txs)

[1] "ENST00000448914.1" "ENST00000631435.1" "ENST00000632684.1" "ENST00000434970.2"
[5] "ENST00000415118.1" "ENST00000633010.1"

In order to extract a transcript’s sequence, we use the transcript name and we need to know the length of the transcript. We pick a transcript at random.

si <- seqinfo(txome)
idx <- 5000
txp_name <- seqnames(si)[idx]
txp_name

[1] "ENST00000560761.5"

txp_len <- seqlengths(si)[idx]
txp_len

ENST00000560761.5 
              657

txp_seq <- getSeq(txome, GRanges(seqnames(si)[idx], IRanges(1, txp_len)))
txp_seq

DNAStringSet object of length 1:
    width seq
[1]   657 AGATATCCTTAGCTAGTGTAACAATGTTGAAAGTGATGTGAGAA...GGGCAAGTCCACCATCACGATCGTGGCCGAGGAGGTATCAGGC

The above method is fairly easy for getting a transcript sequence if we know the cDNA file exists for the same genome and gene version. Another way, if we only have a GRangesList for the transcripts (not genes) is to use the extractTranscriptSeqs function from the GenomicFeatures package. This function takes two arguments: a BSgenome object and the GRangesList of the exons-by-transcript.

suppressPackageStartupMessages(library(EnsDb.Hsapiens.v86))
edb <- EnsDb.Hsapiens.v86

The following line of code can take some time, so I save the file for future use.

library(here)

here() starts at /Users/milove/teach/compbio/compbio_src

file_path <- here("bioc","ebt_cbt.rda")
if (file.exists(file_path)) {
  load(file_path)
} else {
  ebt <- exonsBy(edb, by="tx") # exons-by-transcript
  cbt <- cdsBy(edb, by="tx") # coding-seq-by-transcript
  save(ebt, cbt, file=file_path)
}

txp_name_no_version <- sub("\\..*","",txp_name) # cut version number
txp <- ebt[txp_name_no_version]
library(GenomicFeatures)
# another option here would be ensembldb:::getGenomeTwoBitFile()
txp_seq_extracted <- extractTranscriptSeqs(Hsapiens, txp)
txp_seq_extracted

DNAStringSet object of length 1:
    width seq                                                                   names               
[1]   657 AGATATCCTTAGCTAGTGTAACAATGTTGAAAG...ACCATCACGATCGTGGCCGAGGAGGTATCAGGC ENST00000560761

txp_seq == txp_seq_extracted

[1] TRUE

Note that we can evaluate the peptide sequence using the translate function, however we have to be sure to start with the coding sequence and not include the UTRs (untranslated regions).

coding_loc <- cbt[txp_name_no_version]
coding_seq_extracted <- extractTranscriptSeqs(Hsapiens, coding_loc)
Biostrings::translate(coding_seq_extracted)

AAStringSet object of length 1:
    width seq                                                                   names               
[1]   149 MSDPEMGWVPEPPTMTLGASRVELRVSCHGLLD...IVSQTKVTKPLLLKNGKTAGKSTITIVAEEVSG ENST00000560761

How could you evalute the consequence of a SNP in a coding sequence?

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] here_1.0.1                             EnsDb.Hsapiens.v86_2.99.0             
 [3] ensembldb_2.28.0                       AnnotationFilter_1.28.0               
 [5] GenomicFeatures_1.56.0                 AnnotationDbi_1.66.0                  
 [7] Biobase_2.64.0                         BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000
 [9] BSgenome_1.72.0                        rtracklayer_1.64.0                    
[11] BiocIO_1.14.0                          GenomicRanges_1.56.1                  
[13] AnnotationHub_3.12.0                   BiocFileCache_2.12.0                  
[15] dbplyr_2.5.0                           Biostrings_2.72.1                     
[17] GenomeInfoDb_1.40.1                    XVector_0.44.0                        
[19] IRanges_2.38.1                         S4Vectors_0.42.1                      
[21] BiocGenerics_0.50.0                    testthat_3.2.1.1                      
[23] rmarkdown_2.27                         devtools_2.4.5                        
[25] usethis_3.0.0                         

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                   bitops_1.0-8                remotes_2.5.0              
 [4] rlang_1.1.4                 magrittr_2.0.3              matrixStats_1.3.0          
 [7] compiler_4.4.1              RSQLite_2.3.7               png_0.1-8                  
[10] vctrs_0.6.5                 ProtGenerics_1.36.0         stringr_1.5.1              
[13] profvis_0.3.8               pkgconfig_2.0.3             crayon_1.5.3               
[16] fastmap_1.2.0               ellipsis_0.3.2              utf8_1.2.4                 
[19] Rsamtools_2.20.0            promises_1.3.0              sessioninfo_1.2.2          
[22] UCSC.utils_1.0.0            purrr_1.0.2                 bit_4.0.5                  
[25] xfun_0.46                   zlibbioc_1.50.0             cachem_1.1.0               
[28] jsonlite_1.8.8              blob_1.2.4                  later_1.3.2                
[31] DelayedArray_0.30.1         BiocParallel_1.38.0         parallel_4.4.1             
[34] R6_2.5.1                    stringi_1.8.4               pkgload_1.4.0              
[37] brio_1.1.5                  Rcpp_1.0.13                 SummarizedExperiment_1.34.0
[40] knitr_1.48                  httpuv_1.6.15               Matrix_1.7-0               
[43] tidyselect_1.2.1            rstudioapi_0.16.0           abind_1.4-5                
[46] yaml_2.3.10                 codetools_0.2-20            miniUI_0.1.1.1             
[49] curl_5.2.1                  pkgbuild_1.4.4              lattice_0.22-6             
[52] tibble_3.2.1                shiny_1.9.1                 withr_3.0.1                
[55] KEGGREST_1.44.1             evaluate_0.24.0             urlchecker_1.0.1           
[58] pillar_1.9.0                BiocManager_1.30.23         filelock_1.0.3             
[61] MatrixGenerics_1.16.0       generics_0.1.3              rprojroot_2.0.4            
[64] RCurl_1.98-1.16             BiocVersion_3.19.1          xtable_1.8-4               
[67] glue_1.7.0                  lazyeval_0.2.2              tools_4.4.1                
[70] GenomicAlignments_1.40.0    fs_1.6.4                    XML_3.99-0.17              
[73] grid_4.4.1                  GenomeInfoDbData_1.2.12     restfulr_0.0.15            
[76] cli_3.6.3                   rappdirs_0.3.3              fansi_1.0.6                
[79] S4Arrays_1.4.1              dplyr_1.1.4                 digest_0.6.36              
[82] SparseArray_1.4.8           rjson_0.2.21                htmlwidgets_1.6.4          
[85] memoise_2.0.1               htmltools_0.5.8.1           lifecycle_1.0.4            
[88] httr_1.4.7                  mime_0.12                   bit64_4.0.5