Pattren counter in fasta file
3
0
Entering edit mode
7.5 years ago
adityagt15 • 0

I am trying to get the count of the matching pattren in the fasta file. i am having the fasta files which contains 57k sequences. i want to pull out the count of my matching pattern sequence. And required the starting position of the pattren Input file:

chr1 ATTAGCAGATGTGACGTCGATGTCAGATTG

chr2 TGAGCTGCAGATCGTAGATGATTCTGCAGGAACCT

chr3 TCTTTCAGATGCCTCTGCAGATTC

Searching pattern "CAGAT"

Output Required:

chr Count P1 P2

chr1 -- 2 -- 6 -- 25

chr2 -- 1 -- 8

chr3 -- 2 -- 6 -- 19

Thanks in advance

counter perl awk Regular Expression fasta • 1.9k views
ADD COMMENT
3
Entering edit mode
7.5 years ago

You can also try SeqKit.

$ seqkit locate -p CAGAT seqs.fa | column -t
seqID  patternName  pattern  strand  start  end  matched
chr1   CAGAT        CAGAT    +       6      10   CAGAT
chr1   CAGAT        CAGAT    +       24     28   CAGAT
chr2   CAGAT        CAGAT    +       8      12   CAGAT
chr3   CAGAT        CAGAT    +       6      10   CAGAT
chr3   CAGAT        CAGAT    +       18     22   CAGAT

$ seqkit locate -p CAGAT seqs.fa --bed
chr1    5       10      CAGAT   0       +
chr1    23      28      CAGAT   0       +
chr2    7       12      CAGAT   0       +
chr3    5       10      CAGAT   0       +
chr3    17      22      CAGAT   0       +

$ seqkit locate -p CAGAT seqs.fa --gtf
chr1    SeqKit  location        6       10      0       +       .       gene_id "CAGAT"; 
chr1    SeqKit  location        24      28      0       +       .       gene_id "CAGAT"; 
chr2    SeqKit  location        8       12      0       +       .       gene_id "CAGAT"; 
chr3    SeqKit  location        6       10      0       +       .       gene_id "CAGAT"; 
chr3    SeqKit  location        18      22      0       +       .       gene_id "CAGAT";

Count of every sequences with help of csvtk:

$ seqkit locate -p CAGAT seqs.fa | csvtk freq -t -f seqID -k
seqID   frequency
chr1    2
chr2    1
chr3    2

# multiple search patterns
$ seqkit locate -p CAGAT -p TTC seqs.fa | csvtk freq -t -f seqID,patternName -k | column -t
seqID  patternName  frequency
chr1   CAGAT        2
chr2   CAGAT        1
chr2   TTC          2
chr3   CAGAT        2
chr3   TTC          2
ADD COMMENT
1
Entering edit mode
7.5 years ago

If interested, I wrote a program for this kind of searches, it's here fastaRegexFinder

ADD COMMENT
0
Entering edit mode

Thanks for the response, The link is not opening sir...

ADD REPLY
0
Entering edit mode

Ooops, sorry, try now

ADD REPLY
0
Entering edit mode

is this the way i have to give input

python code.py -f singleline.fasta -m 'CTGCAG'

ADD REPLY
0
Entering edit mode

It should be

python /path/to/fastaRegexFinder.py -f seq.fasta -r 'CAGAT'

It should be fairly well documented how to use it.

ADD REPLY
0
Entering edit mode

Thanks for your response , its working..

ADD REPLY
1
Entering edit mode
7.5 years ago

for R coders:

my fasta file:

>chr1
ATTAGCAGATGTGACGTCGATGTCAGATTG
>chr2
TGAGCTGCAGATCGTAGATGATTCTGCAGGAACCT
>chr3
TCTTTCAGATGCCTCTGCAGATTC

R code

library(Biostrings)
data=readDNAStringSet("~/Desktop/test.fa", format = "fasta")
query=DNAString("CAGAT")
result=as.data.frame(vmatchPattern(query,data))

result

> result
  group group_name start end width
1     1       <NA>     6  10     5
2     1       <NA>    24  28     5
3     2       <NA>     8  12     5
4     3       <NA>     6  10     5
5     3       <NA>    18  22     5
>
ADD COMMENT
0
Entering edit mode

How do we figure out what patterns are represented in that table? You are using CAGAT as the original question but it is not obvious in the results table in present form.

ADD REPLY
1
Entering edit mode
library(Biostrings)
data=readDNAStringSet("~/Desktop/test.fa", format = "fasta")
query=DNAString("CAGAT")
result=cbind(pattern=as.character(query),as.data.frame(vmatchPattern(query,data)))

> result
  pattern group group_name start end width
1   CAGAT     1       <NA>     6  10     5
2   CAGAT     1       <NA>    24  28     5
3   CAGAT     2       <NA>     8  12     5
4   CAGAT     3       <NA>     6  10     5
5   CAGAT     3       <NA>    18  22     5

*Still doesn't work multiple patterns.

ADD REPLY
0
Entering edit mode

Would be better to update the code in the original answer by editing that since that is what people will try to use. This looks better. Can you fix the group names so the actual sequences the pattern is coming from are identified?

ADD REPLY

Login before adding your answer.

Traffic: 1260 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6