Question

visualising letterFrequencyInSlidingView (Biostrings, Bioconductor) output

0

Entering edit mode

2.8 years ago

j.jacob1 • 0

As a new user of Bioconductor, I'm trying to visualise the GC content in a sliding window of 5000 nucleotides, in a region between 10 million and 11 million nucleotides on chromosome 2L of the Drosophila genome. In the visualisation I want to show the mean GC content (e.g. using abline), and highlight the regions that have a GC content > 0.5 and < 0.35. I started by typing:

chr2L_range10_11m <- GRanges("chr2L", IRanges(start = 1e7, end = 11e6))

then got the DNA sequence in that range like so:

chr2L_dna_10_11m <- getSeq(Dmelanogaster, chr2L_range10_11m)

The chr2L_dna_10_11m object looks like it has worked:

DNAStringSet object of length 1: width seq [1] 1000001 TGACAGGATTCACTGGCTCGGGACTATGGGCCAGCACTA...AGCATCGGACAAAATATTGTACAAGGTCTCAGTTCGCA

So far, so good. The problem comes with the letterFrequencyInSlidingView command

letterFrequencyInSlidingView(chr2L_dna_10_11m, letters = "GC", as.prob = TRUE)

This throws an error message:

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘letterFrequencyInSlidingView’ for signature ‘"DNAStringSet"’

How do I resolve this, and get the ggplot?

SessionInfo() R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /gpfs3/apps/eb/2020b/skylake/software/FlexiBLAS/3.0.4-GCC-10.3.0/lib64/libflexiblas.so.3.0

locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
[4] LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods
[9] base

other attached packages: [1] forcats_0.5.1 stringr_1.4.0
[3] dplyr_1.0.6 purrr_0.3.4
[5] readr_1.4.0 tidyr_1.1.3
[7] tibble_3.1.2 ggplot2_3.3.3
[9] tidyverse_1.3.1 BSgenome.Dmelanogaster.UCSC.dm2_1.4.0 [11] BSgenome_1.60.0 rtracklayer_1.52.0
[13] GenomicRanges_1.44.0 Biostrings_2.60.0
[15] GenomeInfoDb_1.28.0 XVector_0.32.0
[17] IRanges_2.26.0 S4Vectors_0.30.0
[19] BiocGenerics_0.38.0

loaded via a namespace (and not attached): [1] MatrixGenerics_1.4.0 Biobase_2.52.0 httr_1.4.2
[4] jsonlite_1.7.2 modelr_0.1.8 assertthat_0.2.1
[7] GenomeInfoDbData_1.2.6 cellranger_1.1.0 Rsamtools_2.8.0
[10] yaml_2.2.1 pillar_1.6.1 backports_1.2.1
[13] lattice_0.20-44 glue_1.4.2 digest_0.6.27
[16] rvest_1.0.0 colorspace_2.0-1 htmltools_0.5.1.1
[19] Matrix_1.3-4 XML_3.99-0.6 pkgconfig_2.0.3
[22] broom_0.7.6 haven_2.4.1 zlibbioc_1.38.0
[25] scales_1.1.1 BiocParallel_1.26.0 generics_0.1.0
[28] ellipsis_0.3.2 withr_2.4.2 SummarizedExperiment_1.22.0 [31] cli_2.5.0 magrittr_2.0.1 crayon_1.4.1
[34] readxl_1.3.1 evaluate_0.14 fs_1.5.0
[37] fansi_0.5.0 xml2_1.3.2 tools_4.1.0
[40] hms_1.1.0 BiocIO_1.2.0 lifecycle_1.0.0
[43] matrixStats_0.59.0 munsell_0.5.0 reprex_2.0.0
[46] DelayedArray_0.18.0 compiler_4.1.0 rlang_0.4.11
[49] grid_4.1.0 RCurl_1.98-1.3 rstudioapi_0.13
[52] rjson_0.2.20 bitops_1.0-7 rmarkdown_2.8
[55] restfulr_0.0.13 gtable_0.3.0 DBI_1.1.1
[58] R6_2.5.0 GenomicAlignments_1.28.0 lubridate_1.7.10
[61] knitr_1.33 utf8_1.2.1 stringi_1.6.2
[64] Rcpp_1.0.7 vctrs_0.3.8 dbplyr_2.1.1
[67] tidyselect_1.1.1 xfun_0.23

Biostrings letterFrequencyInSlidingView Bioconductor ggplot • 1.4k views

ADD COMMENT • link updated 2.8 years ago by cpad0112 21k • written 2.8 years ago by j.jacob1 • 0

0

Entering edit mode

Try:

library(Biostrings)
seq=DNAString("TGACAGGATTCACTGGCTCGGGACTATGGGCCAGCACTAAGCATCGGACAAAATATTGTACAAGGTCTCAGTTCGCA")
letterFrequencyInSlidingView(seq,letters = "GC",  view.width = 2)

You also need to provide the window length for sliding window frequency calculation.

ADD REPLY • link 2.8 years ago by cpad0112 21k

0

Entering edit mode

Thanks, your code works fine, and the view.width argument is a key one that I omitted. There is still a problem however, as my sequence data was obtained differently than in your example (see above). With the view.width argument added in:

letterFrequencyInSlidingView(chr2L_dna_10_11m, view.width = 5000, 
                             letters = "GC", as.prob = TRUE)

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘letterFrequencyInSlidingView’ for signature ‘"DNAStringSet"’

ADD REPLY • link 2.8 years ago by j.jacob1 • 0

0

Entering edit mode

method is applicable to only string, not stringset. Check if you can use loop or apply over stringset. Some thing like this:

for (i in seq (1,length(ds))){
    print (letterFrequencyInSlidingView(DNAString(unlist(ds[i])), 2, "GC", OR="|"))
    }

ds= dnastringset

ADD REPLY • link 2.8 years ago by cpad0112 21k

0

Entering edit mode

OK, at least I get an output that's not an error message although I don't understand it. Curiously, the object chr2L_dna_10_11m is classed as a DNAStringSET not a DNAString although this object is a single stretch of DNA sequence as far as the software is concerned. Along the same lines, I'm curious as to the need for the unlist argument in your code, given that I'm dealing with a contiguous stretch of DNA. I surmise that the GRanges command (and/or the getSeq command??) generates a DNAStringSet object. The other puzzle to me is in your code

 print (letterFrequencyInSlidingView(DNAString(unlist(ds[i])), 2, "GC", OR="|"))

why the number 2? I think that refers to whether there is a G or a C at that position? Doesn't look like it's a fixed.width argument though. Substituting in my objects in place of ds:

for (i in seq (1,length(chr2L_dna_10_11m))){
    print(letterFrequencyInSlidingView(DNAString(unlist(chr2L_dna_10_11m[i])), 2, "GC", OR="|"))
    }

...produces an output of around a million rows that I don't know how to interpret - snapshot of the first 13 rows below: enter image description here

How do I factor in a sliding window (fixed.width) of 5000?

ADD REPLY • link 2.8 years ago by j.jacob1 • 0

0

Entering edit mode

Curiously, the object chr2L_dna_10_11m is classed as a DNAStringSET not a DNAString although this object is a single stretch of DNA sequence as far as the software is concerned

getseq function is creating a stringset object instead of string object.

I'm curious as to the need for the unlist argument in your code

In my earlier example, I worked with DNAstring object and you had DNAstringset object. So I had to create a DNAstringset object. I wanted to create DNAstring object from DNAstringset object. Hence I had to subset (a sequence) as DNAstringset object and convert it is DNAstring object. One can access DNAstring object from DNAstringset object using [[]].

 print (letterFrequencyInSlidingView(DNAString(unlist(ds[i])), 2, "GC", OR="|"))

2 can be any number (width of the window). It's not a chosen number, it's random window size i choose to count G or C in that window.

For your problem, I would suggest this (assuming that there is only one sequence in DNAstringset):

$ chr2L_dna_10_11m <- getSeq(Dmelanogaster, chr2L_range10_11m)
$ letterFrequencyInSlidingView(chr2L_dna_10_11m[[1]], letters = "GC", as.prob = TRUE, view.width=5000, OR="|")

Last line should print G or C in a sliding window of 5000bp in your sequence and step seems to be 1, by default. If you have multiple sequences as DNAstringset, you can use following function:

$ lapply(1:length(ds), function(x) letterFrequencyInSlidingView(chr2L_dna_10_11m[[x]], letters = "GC", as.prob = TRUE, view.width=2, OR="|"))

ADD REPLY • link 2.8 years ago by cpad0112 21k