Looking For Reliable Tools To Do Quality Filtering Of Fastq Files
4
5
Entering edit mode
12.2 years ago
dataminer89 ▴ 50

I am looking for programs that allow one to pre-process and filter large fastq files for various quality measures.

I know of the fastx toolkit but it seems a little long in the tooth (released in 2009) and the documentation of what it actually does seems to be lacking. Plus there are only one or two tools that would be useful for me, the rest seem to be some sort of plotting helpers.

There are publications out there such as this very recent one NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data in PLoS One 2012 but after reading it I am left scratching my head. This is a pure perl QC tool developed to run on Windows which means it has no internal core that could have been written in C to be fast. Makes me wonder of how this even got accepted.

I need some recommendations of tools that have been tried in practice and were proven to be fast and reliable. Ideally I would like to hear of the tool you use. Beside filtering by average quality, clipping and trimming back reads I would like to be able to detect various artifacts that the data might have, for example duplication, preferential enrichment of subsequences, polyadenylation etc.

Thanks for any input!

fastq qc • 9.6k views
ADD COMMENT
0
Entering edit mode

I don't agree with your comments about developing a tool that will run on Windows, writing software that is portable is a good thing. One of the strengths of Perl, for example (or any scripting language), is the relative ease with which you can perform complex tasks like plotting, creating webpages, etc. and have them run on almost any OS. If you haven't found a program to do everything you mention that is written in C, there is probably a reason.

ADD REPLY
0
Entering edit mode

I think the rationale is that parsing and evaluating the fastq format is a surprisingly time consuming operation in interpreted languages due to operations needed to decode a quality character. In addition many of the trimming algorithms may also require various types of inner loops that are again a weakness for these languages. In all it makes it less appropriate for anyone that has large or numerous Fastq files. Heng Li has posted a nice benchmark in this thread How to efficiently parse a huge fastq file?

ADD REPLY
0
Entering edit mode

I agree completely with you about parsing, and I understand the argument. For a lot of tasks, I'll write things in C, but my understanding is that the OP wanted a universal tool to do trimming, plotting, etc. in C and I just haven't seen it. Frankly, I haven't found a tool written in C that actually works for even trimming. They either use way too much memory, or in the case of seqtk, don't actually work. I used seqtk for trimming recently and it is fast, but removed no reads, and left a lot of reads with almost all Ns under default settings.

ADD REPLY
3
Entering edit mode
12.2 years ago
SES 8.6k

I find that PRINSEQ does everything I want, and it will do all the things you listed in your post. It is written in Perl, and while it would be cool to find something producing results this high quality that is written in C, I don't know if it would be as portable, easy to use, or worth the time to develop. But, I'd like to know if you find such a tool!

ADD COMMENT
1
Entering edit mode

The site, manual and all content looks very professional - it greatly surprises me that I have never heard of it before.

ADD REPLY
1
Entering edit mode

They write very nice software, including tagcleaner: http://tagcleaner.sourceforge.net/

and they are professional and responsive to their users, in my experience.

ADD REPLY
2
Entering edit mode
12.2 years ago

Try Biopieces (www.biopieces.org). There is a section on clearning NGS data in the HowTo. It is simple to setup workflows and with GNU Parallel you can easily distribute the tasks to multiple servers.

ADD COMMENT
0
Entering edit mode
12.2 years ago

I don't think there is a single tool that does all that one needs to QC and filter data for all datasets. However, fastqc is one that does give a quick overview in a readable format.

ADD COMMENT
0
Entering edit mode
12.2 years ago

I still use fastx toolkit, and I think it's fast and have no problem with it. I also recently tried trimmomatic for a more complicated trimming situation and I thought it worked nicely.

ADD COMMENT
0
Entering edit mode

this is also something I've never seen before - once this list gets longer I will collect all tools into a tutorial with a bake-off type contest

ADD REPLY

Login before adding your answer.

Traffic: 1830 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6