Question

Identify small strings in a larger sequence in R

0

Entering edit mode

4.0 years ago

Peter ▴ 20

Hi,

I have a vector containing small strings of interest:

seq_vector <-c ("NET | NST | NVT | NIT | NCT | NYT | NHT | NRT | NNT | NDT | NTT ")

And I would like to find these small strings in larger strings, which are in my .txt file:

A0A0D9S786..........STDQNHSTETPNLAAAVPSSVSVPR
A0A0D9R8B0........ STEVQGMKVNGTKTDNNEGPK
A0A0D9RJY3........ STHNLQVAALDANGTVVEGPVPITIEVK

This file has ~ 3600 entries

I am able to perform this procedure in a .fasta sequence, using the seqinr, tydiverse and biostrings packages. But I am having trouble making these data above.

Does anyone have any ideas and could help me? I'm only interested in sequences that match

I would like to get something like:

A0A0D9S786.....STDQNHSTETPNLAAAVPSSVSVPR...... NHS
A0A0D9R8B0.....STEVQGMKVNGTKTDNNEGPK ............ NGT

Thank you in advance!

R • 1.1k views

ADD COMMENT • link updated 4.0 years ago by rpolicastro 13k • written 4.0 years ago by Peter ▴ 20

1

Entering edit mode

Read in your data using

scan(, w="") 
cat('each.line.of.your.seq' , grep(seq_vector ,val=T) , '\n', sep= '.', file= 'output.txt' , append=T)

ADD REPLY • link 4.0 years ago by english.server ▴ 300

0

Entering edit mode

Have you looked at grep() or the stringr package?

ADD REPLY • link 4.0 years ago by Jean-Karim Heriche 27k

score 3 · Answer 1 · 2020-12-03

Example data

df <- structure(list(seq = c("A0A0D9S786..........STDQNHSTETPNLAAAVPSSVSVPR", 
"A0A0D9R8B0........ STEVQGMKVNGTKTDNNEGPK", "A0A0D9RJY3........ STHNLQVAALDANGTVVEGPVPITIEVK"
)), row.names = c(NA, -3L), class = "data.frame")

> df
                                              seq
1   A0A0D9S786..........STDQNHSTETPNLAAAVPSSVSVPR
2        A0A0D9R8B0........ STEVQGMKVNGTKTDNNEGPK
3 A0A0D9RJY3........ STHNLQVAALDANGTVVEGPVPITIEVK

A tidyverse solution.

library("tidyverse")

# I properly formatted your regex and also added a few more seqs since there were no matches with your original example.
seq_vector <- "NET|NST|NVT|NIT|NCT|NYT|NHT|NRT|NNT|NDT|NTT|NHS|LAA|QVA"

matches <- df %>%
  mutate(
    n_matches=str_count(seq, seq_vector),
    matches=str_extract_all(seq, seq_vector)
  ) %>%
  filter(n_matches > 0) %>%
  unnest_wider(matches)

> matches
# A tibble: 2 x 4
  seq                                             n_matches ...1  ...2 
  <chr>                                               <int> <chr> <chr>
1 A0A0D9S786..........STDQNHSTETPNLAAAVPSSVSVPR           2 NHS   LAA  
2 A0A0D9RJY3........ STHNLQVAALDANGTVVEGPVPITIEVK         1 QVA   NA