I'm looking for a practical option for filtering a large fasta or fastq file. I'm dealing with a MiSeq read set where there was a problem with our Index reads, but we can separate the sequences by primer type. So far I have tried this approach:
- use fuzzy matching to query the entire sequence set (agrep) allowing 2 mismatches. save matching sequences in a file.
- then figure out which read names correspond to the matching sequences I identified in step 1 (grep -f)
- then filtering the original fastq with that list of read names.
This works fine for subsets of the sequencing run, but if I want to do the entire run, it will take ages (many days, or perhaps weeks, depending on the size of the sequence set). This isn't practical.
I'm looking for an existing tool (or even a series of bash commands) that can take a query sequence (my primer) and filter the entire fasta or fastq based on a fuzzy match where I can set the number of allowed mismatches. It should be able to handle a full MiSeq run worth of reads (in my case, 5-10GB in fastq format). It doesn't need to work for fastq, since I can always filter my fastq files using the read names in a hypothetical fasta output.
It seems like something of this nature would exist already, but I'm having trouble finding anything that would work for the sizes of dataset I'm dealing with. This works on small sequence sets, but I downloaded it and modified the scripts and html files so it could handle my inputs, and it just crashes now: http://www.bioinformatics.org/sms2/fuzzy_search_dna.html I think the fact that it's linked with an HTML frontend is the issue. I don't have the expertise to modify the .js files beyond parameter modification. I am more familiar with awk, sed, and other bash commands, perl, and python.
Thanks in advance for any tips/answers!
http://emboss.sourceforge.net/apps/cvs/emboss/apps/fuzznuc.html