Grep specific sequence
4
0
Entering edit mode
3.6 years ago

Hi there, Is there any tools to find specific sequence (not as part of a longer sequence) in a fastq file?

So for example if I am looking for AACTACGTCCATCG and only wants "hits" with exact matches AACTACGTCCATCG even thought its within a longer sequence i.e GCACTACTGGTCA(AACTACGTCCATC)? I know that using the function grep will give you both AACTACGTCCATC and GCACTACTGGTCA(AACTACGTCCATC).

Hope it makes sense.

Cheers

fastq smallRNA_Seq smallRNA GREP • 3.9k views
ADD COMMENT
1
Entering edit mode

"not as part of a longer sequence" is exactly the opposite of "even though its withing a longer sequence". Maybe you should clarify that, although it looks like a simple grep would do.

ADD REPLY
1
Entering edit mode
3.6 years ago
Michael 55k

If it is a "well-behaved" (no blank lines, no multi-line seqs, no such sequence in the quality string) fastq file the following also works very nicely to give you all matching fastq records, so no need to install any toolkit:

grep -B1 -A2 -e "^AACTACGTCCATC$" test.fq
ADD COMMENT
1
Entering edit mode

And if it is a "very-well-behaved" one (no such sequence in any header, then a fixed word search would be even slightly faster ;)

grep -B1 -A2 -F -w AACTACGTCCATC test.fq

Edit: just tested on a very large fastq file and the -w option is not at all faster than a proper start-end search. grep -B1 -A2 "^AACTACGTCCATC$" test.fq is definitely faster.

ADD REPLY
0
Entering edit mode

Thanks a lot!!

ADD REPLY
0
Entering edit mode

marie.lorans : Based on clarification below this is the result you wanted (though others are useful in other circumstances). So you should accept Michael Dondrup answer (green checkmark) to provide closure to this thread.

ADD REPLY
0
Entering edit mode

I think in this context "very-well-behaved" == "well-behaved". I thought about the headers as well, but fastq headers start with @, so that should be ok, but it could theoretically appear in the quality string.

ADD REPLY
0
Entering edit mode

Why the -e?

ADD REPLY
0
Entering edit mode

Looks more like a habit, maybe to search for multiple patterns, rather than a requirement in this case.

ADD REPLY
0
Entering edit mode

Indeed, I always write it like this, think some linux guru taught me that this was better.

ADD REPLY
1
Entering edit mode
3.6 years ago
Michael 55k

Seqkit grep should do that: See https://bioinf.shenwei.me/seqkit/usage/#grep The command might look like this (tested, works ok, insert your sequence of choice between ^ and $):

seqkit grep -s -r -i -P -p "^aggcg$"  file.fq
ADD COMMENT
1
Entering edit mode
3.6 years ago
GenoMax 147k

bbduk.sh from BBMap suite in filter mode would work as well. A GUIDE is available.

$ bbduk.sh -Xmx2g in=file1.fq literal=your_seq_of_interest outm=out.fq

Use in1= in2= outm1= outm2= if you have paired-end data.

ADD COMMENT
0
Entering edit mode
3.6 years ago
sg927357 ▴ 20

Use grep -o flag it will give you the exact sequence. Usage: grep -o AACTACGTCCATCG filename

ADD COMMENT
1
Entering edit mode

This is wrong. -o does not limit matching, it is a display related flag.

grep -o AACTACGTCCATCG filename

will print

AACTACGTCCATCG

as many times as the file has that string in it, it won't match in an "exact" manner as grep -w would.

ADD REPLY
0
Entering edit mode

Yeah I know, but accordingly to the question asked, "its within a longer sequence i.e GCACTACTGGTCA(AACTACGTCCATC)", if the string matches within a sequence, it will display only the match portion.

ADD REPLY
0
Entering edit mode

OP's wording is quite ambiguous - it's not clear if they only want exact matches or if they want all matches with exact overlaps highlighted. The latter doesn't make much sense, but I guess if that were their question, your answer would be the best fit. Let's see if they clarify on their requirement.

ADD REPLY
0
Entering edit mode

Thanks a lot. We want exact matches i.e AACTACGTCCATC but not if it's a part of a longer sequence

ADD REPLY

Login before adding your answer.

Traffic: 1293 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6