Filter a large fasta or fastq by a query sequence, with parameterized fuzzy matching
1
0
Entering edit mode
6.1 years ago
ovon ▴ 20

I'm looking for a practical option for filtering a large fasta or fastq file. I'm dealing with a MiSeq read set where there was a problem with our Index reads, but we can separate the sequences by primer type. So far I have tried this approach:

  1. use fuzzy matching to query the entire sequence set (agrep) allowing 2 mismatches. save matching sequences in a file.
  2. then figure out which read names correspond to the matching sequences I identified in step 1 (grep -f)
  3. then filtering the original fastq with that list of read names.

This works fine for subsets of the sequencing run, but if I want to do the entire run, it will take ages (many days, or perhaps weeks, depending on the size of the sequence set). This isn't practical.

I'm looking for an existing tool (or even a series of bash commands) that can take a query sequence (my primer) and filter the entire fasta or fastq based on a fuzzy match where I can set the number of allowed mismatches. It should be able to handle a full MiSeq run worth of reads (in my case, 5-10GB in fastq format). It doesn't need to work for fastq, since I can always filter my fastq files using the read names in a hypothetical fasta output.

It seems like something of this nature would exist already, but I'm having trouble finding anything that would work for the sizes of dataset I'm dealing with. This works on small sequence sets, but I downloaded it and modified the scripts and html files so it could handle my inputs, and it just crashes now: http://www.bioinformatics.org/sms2/fuzzy_search_dna.html I think the fact that it's linked with an HTML frontend is the issue. I don't have the expertise to modify the .js files beyond parameter modification. I am more familiar with awk, sed, and other bash commands, perl, and python.

Thanks in advance for any tips/answers!

fasta fastq filtering • 4.5k views
ADD COMMENT
3
Entering edit mode
6.1 years ago

Hello,

have a look at seqkit grep.

fin swimmer

ADD COMMENT
0
Entering edit mode

Thank you very much, I tested this just now and it works very well. Much faster than my agrep-based method. A very useful tool to know about.

ADD REPLY

Login before adding your answer.

Traffic: 2148 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6