Question

Looking For Reliable Tools To Do Quality Filtering Of Fastq Files

5

Entering edit mode

12.2 years ago

dataminer89 ▴ 50

I am looking for programs that allow one to pre-process and filter large fastq files for various quality measures.

I know of the fastx toolkit but it seems a little long in the tooth (released in 2009) and the documentation of what it actually does seems to be lacking. Plus there are only one or two tools that would be useful for me, the rest seem to be some sort of plotting helpers.

There are publications out there such as this very recent one NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data in PLoS One 2012 but after reading it I am left scratching my head. This is a pure perl QC tool developed to run on Windows which means it has no internal core that could have been written in C to be fast. Makes me wonder of how this even got accepted.

I need some recommendations of tools that have been tried in practice and were proven to be fast and reliable. Ideally I would like to hear of the tool you use. Beside filtering by average quality, clipping and trimming back reads I would like to be able to detect various artifacts that the data might have, for example duplication, preferential enrichment of subsequences, polyadenylation etc.

Thanks for any input!

fastq qc • 9.6k views

ADD COMMENT • link updated 12.2 years ago by Madelaine Gogol 5.3k • written 12.2 years ago by dataminer89 ▴ 50

0

Entering edit mode

I don't agree with your comments about developing a tool that will run on Windows, writing software that is portable is a good thing. One of the strengths of Perl, for example (or any scripting language), is the relative ease with which you can perform complex tasks like plotting, creating webpages, etc. and have them run on almost any OS. If you haven't found a program to do everything you mention that is written in C, there is probably a reason.

ADD REPLY • link 12.2 years ago by SES 8.6k

0

Entering edit mode

I think the rationale is that parsing and evaluating the fastq format is a surprisingly time consuming operation in interpreted languages due to operations needed to decode a quality character. In addition many of the trimming algorithms may also require various types of inner loops that are again a weakness for these languages. In all it makes it less appropriate for anyone that has large or numerous Fastq files. Heng Li has posted a nice benchmark in this thread How to efficiently parse a huge fastq file?

ADD REPLY • link 12.2 years ago by Istvan Albert 102k

0

Entering edit mode

I agree completely with you about parsing, and I understand the argument. For a lot of tasks, I'll write things in C, but my understanding is that the OP wanted a universal tool to do trimming, plotting, etc. in C and I just haven't seen it. Frankly, I haven't found a tool written in C that actually works for even trimming. They either use way too much memory, or in the case of seqtk, don't actually work. I used seqtk for trimming recently and it is fast, but removed no reads, and left a lot of reads with almost all Ns under default settings.

ADD REPLY • link 12.2 years ago by SES 8.6k

score 3 · Answer 1 · 2012-09-08

3

Entering edit mode

12.2 years ago

SES 8.6k

I find that PRINSEQ does everything I want, and it will do all the things you listed in your post. It is written in Perl, and while it would be cool to find something producing results this high quality that is written in C, I don't know if it would be as portable, easy to use, or worth the time to develop. But, I'd like to know if you find such a tool!

ADD COMMENT • link 12.2 years ago by SES 8.6k

1

Entering edit mode

The site, manual and all content looks very professional - it greatly surprises me that I have never heard of it before.

ADD REPLY • link 12.2 years ago by Istvan Albert 102k

1

Entering edit mode

They write very nice software, including tagcleaner: http://tagcleaner.sourceforge.net/

and they are professional and responsive to their users, in my experience.

ADD REPLY • link 12.2 years ago by SES 8.6k

score 2 · Answer 2 · 2012-09-08

2

Entering edit mode

12.2 years ago

Martin A Hansen 3.0k

Try Biopieces (www.biopieces.org). There is a section on clearning NGS data in the HowTo. It is simple to setup workflows and with GNU Parallel you can easily distribute the tasks to multiple servers.

ADD COMMENT • link 12.2 years ago by Martin A Hansen 3.0k

score 0 · Answer 3 · 2012-09-08

0

Entering edit mode

12.2 years ago

Sean Davis 27k

I don't think there is a single tool that does all that one needs to QC and filter data for all datasets. However, fastqc is one that does give a quick overview in a readable format.

ADD COMMENT • link 12.2 years ago by Sean Davis 27k

score 0 · Answer 4 · 2012-09-10

0

Entering edit mode

12.2 years ago

Madelaine Gogol 5.3k

I still use fastx toolkit, and I think it's fast and have no problem with it. I also recently tried trimmomatic for a more complicated trimming situation and I thought it worked nicely.

ADD COMMENT • link 12.2 years ago by Madelaine Gogol 5.3k

0

Entering edit mode

this is also something I've never seen before - once this list gets longer I will collect all tools into a tutorial with a bake-off type contest

ADD REPLY • link 12.2 years ago by Istvan Albert 102k