Question

Pattern match using R Biostrings

0

Entering edit mode

2.2 years ago

asumani ▴ 70

Hi,

Aim: I am trying to get the positions of all stop codons and type of the stop codon given a DNAstring object or a character string.

stops <- c("TAG","TAA","TGA")
vmatchPattern(stop, stringObj)

I also tried to define stops as "TAA|TAG|TGA" and I know non is supported by vmatchPattern function. Then I tried:

stop1 <-matchPattern(c("TAG"), as(trx, "character")) %>% 
  as.data.frame()
stop2<- matchPattern(c("TAA"), as(trx, "character")) %>% 
  as.data.frame()
stop3<-matchPattern(c("TGA"), as(trx, "character")) %>% 
  as.data.frame()
stops <- rbind(stop1,stop2,stop3)

Outcome below is very much satisfying, I wish I could find a much clever solution.

start  end width seq
1    178  180     3 TAG
2    400  402     3 TAG
3    427  429     3 TAG
4    574  576     3 TAG
5    344  346     3 TAA
6    443  445     3 TAA
7    692  694     3 TAA
8     48   50     3 TGA
9     88   90     3 TGA
10   437  439     3 TGA
11   455  457     3 TGA
12   496  498     3 TGA
13   509  511     3 TGA
14   538  540     3 TGA
15   649  651     3 TGA
16   746  748     3 TGA

Can we find another solution to this problem of mine?

sessionInfo( )
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_IE.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_IE.UTF-8        LC_COLLATE=en_IE.UTF-8    
 [5] LC_MONETARY=en_IE.UTF-8    LC_MESSAGES=en_IE.UTF-8   
 [7] LC_PAPER=en_IE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Biostrings_2.64.1   GenomeInfoDb_1.32.4 XVector_0.36.0      IRanges_2.30.1     
 [5] S4Vectors_0.34.0    BiocGenerics_0.42.0 gridExtra_2.3       forcats_0.5.2      
 [9] stringr_1.4.1       dplyr_1.0.10        purrr_0.3.4         readr_2.1.3        
[13] tidyr_1.2.1         tibble_3.1.8        ggplot2_3.3.6       tidyverse_1.3.2    

loaded via a namespace (and not attached):
 [1] lubridate_1.8.0        assertthat_0.2.1       digest_0.6.29         
 [4] utf8_1.2.2             R6_2.5.1               cellranger_1.1.0      
 [7] backports_1.4.1        reprex_2.0.2           evaluate_0.16         
[10] httr_1.4.4             pillar_1.8.1           zlibbioc_1.42.0       
[13] rlang_1.0.6            googlesheets4_1.0.1    readxl_1.4.1          
[16] rstudioapi_0.14        rmarkdown_2.16         labeling_0.4.2        
[19] googledrive_2.0.0      bit_4.0.4              RCurl_1.98-1.8        
[22] munsell_0.5.0          broom_1.0.1            compiler_4.2.1        
[25] modelr_0.1.9           xfun_0.33              pkgconfig_2.0.3       
[28] htmltools_0.5.3        tidyselect_1.1.2       GenomeInfoDbData_1.2.8
[31] fansi_1.0.3            crayon_1.5.2           tzdb_0.3.0            
[34] dbplyr_2.2.1           withr_2.5.0            bitops_1.0-7          
[37] grid_4.2.1             jsonlite_1.8.2         gtable_0.3.1          
[40] lifecycle_1.0.2        DBI_1.1.3              magrittr_2.0.3        
[43] scales_1.2.1           vroom_1.6.0            cli_3.4.1             
[46] stringi_1.7.8          farver_2.1.1           fs_1.5.2              
[49] xml2_1.3.3             ellipsis_0.3.2         generics_0.1.3        
[52] vctrs_0.4.2            tools_4.2.1            bit64_4.0.5           
[55] glue_1.6.2             hms_1.1.2              parallel_4.2.1        
[58] fastmap_1.1.0          yaml_2.3.5             colorspace_2.0-3      
[61] gargle_1.2.1           rvest_1.0.3            knitr_1.40            
[64] haven_2.5.1

biostrings pattern • 1.3k views

ADD COMMENT • link updated 2.2 years ago by Kevin Blighe 88k • written 2.2 years ago by asumani ▴ 70

0

Entering edit mode

Could you give an example of your original dataset ? dput(stringObj) for instance

ADD REPLY • link 2.2 years ago by Basti ★ 2.0k

0

Entering edit mode

Here is an example. I want to do the search on individual transcripts, not the entire object.

>dput(trx)
new("DNAStringSet", pool = new("SharedRaw_Pool", xp_list = list(
    <pointer: (nil)>), .link_to_cached_object_list = list(<environment>)), 
    ranges = new("GroupedIRanges", group = 1L, start = 108951763L, 
        width = 2252L, NAMES = "GABPXX", 
        elementType = "ANY", elementMetadata = NULL, metadata = list()), 
    elementType = "DNAString", elementMetadata = NULL, metadata = list())

ADD REPLY • link 2.2 years ago by asumani ▴ 70

0

Entering edit mode

Cross-posted: https://support.bioconductor.org/p/9146842/

ADD REPLY • link 2.2 years ago by Kevin Blighe 88k

0

Entering edit mode

Is this something wrong? Since it is bioconductor specific question, I thought I could reach a wider range of people. I didn't intend to spam.

ADD REPLY • link 2.2 years ago by asumani ▴ 70

1

Entering edit mode

Generally it's good etiquette to post only one place at a time since you are using the time of multiple scientists for a single question if you cross post.

ADD REPLY • link 2.2 years ago by rpolicastro 13k

0

Entering edit mode

Thank you! I will be careful next time.

ADD REPLY • link 2.2 years ago by asumani ▴ 70

0

Entering edit mode

Gracias amigo / Go raibh maith agat mo chara

ADD REPLY • link 2.2 years ago by Kevin Blighe 88k