As a new user of Bioconductor, I'm trying to visualise the GC content in a sliding window of 5000 nucleotides, in a region between 10 million and 11 million nucleotides on chromosome 2L of the Drosophila genome. In the visualisation I want to show the mean GC content (e.g. using abline), and highlight the regions that have a GC content > 0.5 and < 0.35. I started by typing:
chr2L_range10_11m <- GRanges("chr2L", IRanges(start = 1e7, end = 11e6))
then got the DNA sequence in that range like so:
chr2L_dna_10_11m <- getSeq(Dmelanogaster, chr2L_range10_11m)
The chr2L_dna_10_11m
object looks like it has worked:
DNAStringSet object of length 1: width seq [1] 1000001 TGACAGGATTCACTGGCTCGGGACTATGGGCCAGCACTA...AGCATCGGACAAAATATTGTACAAGGTCTCAGTTCGCA
So far, so good. The problem comes with the letterFrequencyInSlidingView
command
letterFrequencyInSlidingView(chr2L_dna_10_11m, letters = "GC", as.prob = TRUE)
This throws an error message:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘letterFrequencyInSlidingView’ for signature ‘"DNAStringSet"’
How do I resolve this, and get the ggplot?
SessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default BLAS/LAPACK: /gpfs3/apps/eb/2020b/skylake/software/FlexiBLAS/3.0.4-GCC-10.3.0/lib64/libflexiblas.so.3.0
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
[4] LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods
[9] base
other attached packages:
[1] forcats_0.5.1 stringr_1.4.0
[3] dplyr_1.0.6 purrr_0.3.4
[5] readr_1.4.0 tidyr_1.1.3
[7] tibble_3.1.2 ggplot2_3.3.3
[9] tidyverse_1.3.1 BSgenome.Dmelanogaster.UCSC.dm2_1.4.0
[11] BSgenome_1.60.0 rtracklayer_1.52.0
[13] GenomicRanges_1.44.0 Biostrings_2.60.0
[15] GenomeInfoDb_1.28.0 XVector_0.32.0
[17] IRanges_2.26.0 S4Vectors_0.30.0
[19] BiocGenerics_0.38.0
loaded via a namespace (and not attached):
[1] MatrixGenerics_1.4.0 Biobase_2.52.0 httr_1.4.2
[4] jsonlite_1.7.2 modelr_0.1.8 assertthat_0.2.1
[7] GenomeInfoDbData_1.2.6 cellranger_1.1.0 Rsamtools_2.8.0
[10] yaml_2.2.1 pillar_1.6.1 backports_1.2.1
[13] lattice_0.20-44 glue_1.4.2 digest_0.6.27
[16] rvest_1.0.0 colorspace_2.0-1 htmltools_0.5.1.1
[19] Matrix_1.3-4 XML_3.99-0.6 pkgconfig_2.0.3
[22] broom_0.7.6 haven_2.4.1 zlibbioc_1.38.0
[25] scales_1.1.1 BiocParallel_1.26.0 generics_0.1.0
[28] ellipsis_0.3.2 withr_2.4.2 SummarizedExperiment_1.22.0
[31] cli_2.5.0 magrittr_2.0.1 crayon_1.4.1
[34] readxl_1.3.1 evaluate_0.14 fs_1.5.0
[37] fansi_0.5.0 xml2_1.3.2 tools_4.1.0
[40] hms_1.1.0 BiocIO_1.2.0 lifecycle_1.0.0
[43] matrixStats_0.59.0 munsell_0.5.0 reprex_2.0.0
[46] DelayedArray_0.18.0 compiler_4.1.0 rlang_0.4.11
[49] grid_4.1.0 RCurl_1.98-1.3 rstudioapi_0.13
[52] rjson_0.2.20 bitops_1.0-7 rmarkdown_2.8
[55] restfulr_0.0.13 gtable_0.3.0 DBI_1.1.1
[58] R6_2.5.0 GenomicAlignments_1.28.0 lubridate_1.7.10
[61] knitr_1.33 utf8_1.2.1 stringi_1.6.2
[64] Rcpp_1.0.7 vctrs_0.3.8 dbplyr_2.1.1
[67] tidyselect_1.1.1 xfun_0.23
Try:
You also need to provide the window length for sliding window frequency calculation.
Thanks, your code works fine, and the
view.width
argument is a key one that I omitted. There is still a problem however, as my sequence data was obtained differently than in your example (see above). With theview.width
argument added in:method is applicable to only string, not stringset. Check if you can use loop or apply over stringset. Some thing like this:
ds= dnastringset
OK, at least I get an output that's not an error message although I don't understand it. Curiously, the object
chr2L_dna_10_11m
is classed as a DNAStringSET not a DNAString although this object is a single stretch of DNA sequence as far as the software is concerned. Along the same lines, I'm curious as to the need for theunlist
argument in your code, given that I'm dealing with a contiguous stretch of DNA. I surmise that theGRanges
command (and/or thegetSeq
command??) generates a DNAStringSet object. The other puzzle to me is in your codewhy the number 2? I think that refers to whether there is a G or a C at that position? Doesn't look like it's a
fixed.width
argument though. Substituting in my objects in place of ds:...produces an output of around a million rows that I don't know how to interpret - snapshot of the first 13 rows below:
How do I factor in a sliding window (
fixed.width
) of 5000?getseq function is creating a stringset object instead of string object.
In my earlier example, I worked with DNAstring object and you had DNAstringset object. So I had to create a DNAstringset object. I wanted to create DNAstring object from DNAstringset object. Hence I had to subset (a sequence) as DNAstringset object and convert it is DNAstring object. One can access DNAstring object from DNAstringset object using [[]].
2 can be any number (width of the window). It's not a chosen number, it's random window size i choose to count G or C in that window.
For your problem, I would suggest this (assuming that there is only one sequence in DNAstringset):
Last line should print G or C in a sliding window of 5000bp in your sequence and step seems to be 1, by default. If you have multiple sequences as DNAstringset, you can use following function: