Entering edit mode
2.2 years ago
asumani
▴
70
Hi,
Aim: I am trying to get the positions of all stop codons and type of the stop codon given a DNAstring object or a character string.
stops <- c("TAG","TAA","TGA")
vmatchPattern(stop, stringObj)
I also tried to define stops as "TAA|TAG|TGA"
and I know non is supported by vmatchPattern
function. Then I tried:
stop1 <-matchPattern(c("TAG"), as(trx, "character")) %>%
as.data.frame()
stop2<- matchPattern(c("TAA"), as(trx, "character")) %>%
as.data.frame()
stop3<-matchPattern(c("TGA"), as(trx, "character")) %>%
as.data.frame()
stops <- rbind(stop1,stop2,stop3)
Outcome below is very much satisfying, I wish I could find a much clever solution.
start end width seq
1 178 180 3 TAG
2 400 402 3 TAG
3 427 429 3 TAG
4 574 576 3 TAG
5 344 346 3 TAA
6 443 445 3 TAA
7 692 694 3 TAA
8 48 50 3 TGA
9 88 90 3 TGA
10 437 439 3 TGA
11 455 457 3 TGA
12 496 498 3 TGA
13 509 511 3 TGA
14 538 540 3 TGA
15 649 651 3 TGA
16 746 748 3 TGA
Can we find another solution to this problem of mine?
sessionInfo( )
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_IE.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_IE.UTF-8 LC_COLLATE=en_IE.UTF-8
[5] LC_MONETARY=en_IE.UTF-8 LC_MESSAGES=en_IE.UTF-8
[7] LC_PAPER=en_IE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.64.1 GenomeInfoDb_1.32.4 XVector_0.36.0 IRanges_2.30.1
[5] S4Vectors_0.34.0 BiocGenerics_0.42.0 gridExtra_2.3 forcats_0.5.2
[9] stringr_1.4.1 dplyr_1.0.10 purrr_0.3.4 readr_2.1.3
[13] tidyr_1.2.1 tibble_3.1.8 ggplot2_3.3.6 tidyverse_1.3.2
loaded via a namespace (and not attached):
[1] lubridate_1.8.0 assertthat_0.2.1 digest_0.6.29
[4] utf8_1.2.2 R6_2.5.1 cellranger_1.1.0
[7] backports_1.4.1 reprex_2.0.2 evaluate_0.16
[10] httr_1.4.4 pillar_1.8.1 zlibbioc_1.42.0
[13] rlang_1.0.6 googlesheets4_1.0.1 readxl_1.4.1
[16] rstudioapi_0.14 rmarkdown_2.16 labeling_0.4.2
[19] googledrive_2.0.0 bit_4.0.4 RCurl_1.98-1.8
[22] munsell_0.5.0 broom_1.0.1 compiler_4.2.1
[25] modelr_0.1.9 xfun_0.33 pkgconfig_2.0.3
[28] htmltools_0.5.3 tidyselect_1.1.2 GenomeInfoDbData_1.2.8
[31] fansi_1.0.3 crayon_1.5.2 tzdb_0.3.0
[34] dbplyr_2.2.1 withr_2.5.0 bitops_1.0-7
[37] grid_4.2.1 jsonlite_1.8.2 gtable_0.3.1
[40] lifecycle_1.0.2 DBI_1.1.3 magrittr_2.0.3
[43] scales_1.2.1 vroom_1.6.0 cli_3.4.1
[46] stringi_1.7.8 farver_2.1.1 fs_1.5.2
[49] xml2_1.3.3 ellipsis_0.3.2 generics_0.1.3
[52] vctrs_0.4.2 tools_4.2.1 bit64_4.0.5
[55] glue_1.6.2 hms_1.1.2 parallel_4.2.1
[58] fastmap_1.1.0 yaml_2.3.5 colorspace_2.0-3
[61] gargle_1.2.1 rvest_1.0.3 knitr_1.40
[64] haven_2.5.1
Could you give an example of your original dataset ?
dput(stringObj)
for instanceHere is an example. I want to do the search on individual transcripts, not the entire object.
Cross-posted: https://support.bioconductor.org/p/9146842/
Is this something wrong? Since it is bioconductor specific question, I thought I could reach a wider range of people. I didn't intend to spam.
Generally it's good etiquette to post only one place at a time since you are using the time of multiple scientists for a single question if you cross post.
Thank you! I will be careful next time.
Gracias amigo / Go raibh maith agat mo chara