Fast Qseq To Fastq Converter
3
4
Entering edit mode
13.7 years ago
Darked89 4.7k

I got qseq2fastq.pl from Galaxy tools but I am looking for something which can do the job faster. So far I discovered one C++ program from Keio U. but I am still trying to make it work (with a great help from the program author). Are there any other such tools out there which you managed to use?

EDIT 1 Forgot about python program written by Brad Chapman repo:

EDIT 2 At least with the conversion from Illumina to Sanger quality scores there is a difference in speed converting a single file. While this was a lame benchmarking (remote disc with multiple users, real times) there is a difference:

C++ (by Kris Popendorf): run 3x == 2m30s

java (by Pablo, see below) run 2x == 6m45s

perl (Galaxy tools, perl, v5.10.1) run 3x == 6m50s

Input file was a bad quality or PhiX (no index tag) 6634666 lines of 96bp.

There are also differences in output due to the naming convention:

C++
@QSEQ80.1 HWI-ST227:5:1:1140:1008#..../1 PF=0 length=96
+QSEQ80.1 HWI-ST227:5:1:1140:1008#..../1 PF=0 length=96

java
@seq_10
+

perl (I modified the script to output just + )
@5:1:1140:1008:N
+

EDIT 3 fixed broken link to Keio U. C++ program.

fastq next-gen sequencing • 9.0k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
1
Entering edit mode

you might want to retitle this as a programming challenge or code golf

ADD REPLY
1
Entering edit mode

gist.github.com/549824 DOES NOT convert the quality score!!!! Right? That is the most time consuming bit and the most important as it is difficult to spot!

ADD REPLY
1
Entering edit mode

See related thread on seqanswers: http://seqanswers.com/forums/showthread.php?t=1801

ADD REPLY
0
Entering edit mode

@Stefano, correct, it doesn't change the quality encoding. that's partly why i didn't post it as an answer.

ADD REPLY
3
Entering edit mode
13.7 years ago
Pablo ★ 1.9k

Although I agree with Brad that this is mostly an I/O bound process, a script can be slower that your I/O. Furthermore, you can sometimes benefit from running several instances in parallel (when you have RAID5 or similar).

Here is a java program I made, which (IMO) is faster than the equivalent perl/python script.

http://sourceforge.net/projects/snpeff/files/qseq2fastq.jar/download

To use it: cat in.qseq | java -jar qseq2fastq.jar-phred64 > out.fastq

And, as Daniel said, it can be parallelised trivially.

ADD COMMENT
0
Entering edit mode

Hi Pablo, I was also interested in perhaps using the program you've made, unfortunately, it is completely missing from the link you provided.

ADD REPLY
3
Entering edit mode
13.7 years ago

Author of Keio U.'s qseq2fastq here. Jeremy Leipzig is correct that this makes a nice "programming golf" problem in that we have a a bucket full of working solutions, and it's just an exercise in optimization to make it go faster. To that end, my approach does a lot better than perl and (according to OP) java. Given that generating real qseq files from an Illumina GA takes waaay longer than any format conversion tool, it might seem strange to put so much effort into this step (why not just wait another hour or two? It's a loong coffee break). But in some situations, like with simulated data, or lots of different data that we maybe didn't generate ourselves, it's worthwhile to make it go faster (I was finding myself in both of these categories, so I put some effort into making this).

We obviously can't go faster than the IO lets us (which can actually be quite fast with good caching, SSDs, or maybe even ramdisks if we're working with simulated data), so the goal is to waste as little time doing non-IO stuff as possible. If you're careful, C++ can do a pretty good job at that. In my testing, perl at least goes way slower than the limit of our disks, so there's room for a lot of improvement over perl at least.

And yes, it is trivially parallelizable. When dealing with Illumina generated qseq files, generally you have a huge directory of files 1 per tile (there can be like 120+ tiles) per lane per paired-end (about 1680 total for a full PE run), that most of that time we'd like to merge into at most 1 or 2 files per lane. I run 1 thread per output file (theoretically if you had a magically fast way of merging files you might run 1 thread per input file, but that seems a little crazy). For applications like Velvet, I find this level of parallelism to be very handy, because we can do things like filtering on paired-ends considered as pairs while converting (ie. get a "velvet shuffled file" (pair-ends muxed) where both pair members pass an arbitrary filter).

I'm sure it could go even faster with some more tweaking, but for now I'm happy with it. And it's readable enough that it's fairly easy to add custom filters or wrap into some other code.

ADD COMMENT
0
Entering edit mode

It works as expected and it is as fast as before.

ADD REPLY
0
Entering edit mode

Hi Darked89, I am having difficulty compiling the converter with scons. I am new to Bioinformatics and I would extremely appreciate it, if you could possibly list the commands you used to:

1)install scons in the designated environment and

2)subsequently the commands necessary to compile qseq2fastq.pl

Also, does this converter work on qseq.txt files only? Reason I ask is because my format is qseq not qseq.txt. Thank you in advance!

ADD REPLY
0
Entering edit mode

After installing scons, by simply type scons in the qseq2fastq directory, you can get the executbable qseq2fastq. qseq2fastq -i seq_fold/* you will get the results.

ADD REPLY
2
Entering edit mode
13.7 years ago
User 59 13k

I have to say I have been toying with this idea for a while (but of course doing nothing about it).

This is obviously a task that can be parallelised trivially (I get more than 1 qseq file per experiment, so I assume you all do too!)

Any conversion program could be fed to GNU Parallel - all the qseq->fastq converters I've seen are single threaded.

You could do worse than splitting the load across a few cores and concatenating the results sensibly afterwards.

Sorry, no proof of principle, just thoughts :)

ADD COMMENT
1
Entering edit mode

Daniel, parallel processes might not help much. My experience is that this is all IO bound; we run the Python script Darek mentions on the sequencing dump machine, and speed is entirely dependent on how busy the disk is.

ADD REPLY
0
Entering edit mode

I did wonder actually if that might be the case. I don't have sufficient amounts of data that the conversion time is rate limiting. Plenty to do whilst it happens, including the traditional cup of coffee..

ADD REPLY

Login before adding your answer.

Traffic: 2899 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6