I am looking for programs that allow one to pre-process and filter large fastq files for various quality measures.
I know of the fastx toolkit but it seems a little long in the tooth (released in 2009) and the documentation of what it actually does seems to be lacking. Plus there are only one or two tools that would be useful for me, the rest seem to be some sort of plotting helpers.
There are publications out there such as this very recent one NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data in PLoS One 2012 but after reading it I am left scratching my head. This is a pure perl QC tool developed to run on Windows which means it has no internal core that could have been written in C to be fast. Makes me wonder of how this even got accepted.
I need some recommendations of tools that have been tried in practice and were proven to be fast and reliable. Ideally I would like to hear of the tool you use. Beside filtering by average quality, clipping and trimming back reads I would like to be able to detect various artifacts that the data might have, for example duplication, preferential enrichment of subsequences, polyadenylation etc.
Thanks for any input!
I don't agree with your comments about developing a tool that will run on Windows, writing software that is portable is a good thing. One of the strengths of Perl, for example (or any scripting language), is the relative ease with which you can perform complex tasks like plotting, creating webpages, etc. and have them run on almost any OS. If you haven't found a program to do everything you mention that is written in C, there is probably a reason.
I think the rationale is that parsing and evaluating the fastq format is a surprisingly time consuming operation in interpreted languages due to operations needed to decode a quality character. In addition many of the trimming algorithms may also require various types of inner loops that are again a weakness for these languages. In all it makes it less appropriate for anyone that has large or numerous Fastq files. Heng Li has posted a nice benchmark in this thread How to efficiently parse a huge fastq file?
I agree completely with you about parsing, and I understand the argument. For a lot of tasks, I'll write things in C, but my understanding is that the OP wanted a universal tool to do trimming, plotting, etc. in C and I just haven't seen it. Frankly, I haven't found a tool written in C that actually works for even trimming. They either use way too much memory, or in the case of seqtk, don't actually work. I used seqtk for trimming recently and it is fast, but removed no reads, and left a lot of reads with almost all Ns under default settings.