Hi there,
Is there any tools to find specific sequence (not as part of a longer sequence) in a fastq file?
So for example if I am looking for AACTACGTCCATCG and only wants "hits" with exact matches AACTACGTCCATCG even thought its within a longer sequence i.e GCACTACTGGTCA(AACTACGTCCATC)? I know that using the function grep will give you both AACTACGTCCATC and GCACTACTGGTCA(AACTACGTCCATC).
"not as part of a longer sequence" is exactly the opposite of "even though its withing a longer sequence". Maybe you should clarify that, although it looks like a simple grep would do.
If it is a "well-behaved" (no blank lines, no multi-line seqs, no such sequence in the quality string) fastq file the following also works very nicely to give you all matching fastq records, so no need to install any toolkit:
And if it is a "very-well-behaved" one (no such sequence in any header, then a fixed word search would be even slightly faster ;)
grep -B1 -A2 -F -w AACTACGTCCATC test.fq
Edit: just tested on a very large fastq file and the -w option is not at all faster than a proper start-end search. grep -B1 -A2 "^AACTACGTCCATC$" test.fq is definitely faster.
marie.lorans : Based on clarification below this is the result you wanted (though others are useful in other circumstances). So you should accept Michael Dondrup answer (green checkmark) to provide closure to this thread.
I think in this context "very-well-behaved" == "well-behaved". I thought about the headers as well, but fastq headers start with @, so that should be ok, but it could theoretically appear in the quality string.
Seqkit grep should do that: See https://bioinf.shenwei.me/seqkit/usage/#grep
The command might look like this (tested, works ok, insert your sequence of choice between ^ and $):
Yeah I know, but accordingly to the question asked, "its within a longer sequence i.e GCACTACTGGTCA(AACTACGTCCATC)", if the string matches within a sequence, it will display only the match portion.
OP's wording is quite ambiguous - it's not clear if they only want exact matches or if they want all matches with exact overlaps highlighted. The latter doesn't make much sense, but I guess if that were their question, your answer would be the best fit. Let's see if they clarify on their requirement.
"not as part of a longer sequence" is exactly the opposite of "even though its withing a longer sequence". Maybe you should clarify that, although it looks like a simple grep would do.