I would use python (no dependencies):
1. read you read names into list1 and change list to set (it's hashable, so checking for present of element is much faster than in list)
2. parse you fastq file and check if each name is present in set of read names of interest
#!/usr/bin/python
import sys
#get fname from parameter
idfile=sys.argv[1]
#load ids
ids = set( x.strip() for x in open(idfile) )
#read stdid
handle=sys.stdin
while ids:
#parse fastq
idline=handle.readline()
seq =handle.readline()
spacer=handle.readline()
quals =handle.readline()
#check
id=idline[:-1]
if id in ids:
#print fastq
sys.stdout.write( '%s%s%s%s' % ( idline, seq, spacer, quals ) )
#update ids
ids.remove( id )
My python implementation is over 3x faster than c++ by Pierre. It's due to implementation of set (hash-able) instead of usual list.
#c++
time cat PL429_q10_filtered_1.fastq | ./parse PL429_q10_filtered_1.fastq.lids > PL429_q10_filtered_1.fastq.lids.fqcpp
real 4m21.137s
user 4m15.950s
sys 0m10.510s
#python (3.4x faster than c++)
time cat PL429_q10_filtered_1.fastq | python fastq_id_parser.py PL429_q10_filtered_1.fastq.lids > PL429_q10_filtered_1.fastq.fqSet
real 1m11.155s
user 1m5.270s
sys 0m10.920s
#python list instead of set (56x slower than python using set!)
time cat PL429_q10_filtered_1.fastq | python fastq_id_parser.noBioPython.noSet.py PL429_q10_filtered_1.fastq.lids > PL429_q10_filtered_1.fastq.lids.fqNoSet
real 70m54.534s
user 70m18.650s
sys 0m16.740s
#parsing file
time cat PL429_q10_filtered_1.fastq | wc -l
115959528
real 0m30.373s
user 0m1.990s
sys 0m8.250s
My test set was 10k: first and last 5k headers of the fq.
Fq was 4.9GB, 115,959,528 lines, seq between 31-75bases. I used 15k SAS drive - read speed is ~200MB/s, access time ~5ms.
same as before... python :) my 2 cents;) btw: check how many times perl implementation take than
wc -l
, as python implementation takes ~2x longer, c++ ~8x longer, python using list 140x longer. so for 70mln reads the differences might be in hours:PYou can also have a look at this post : NGS - Huge (FASTQ) file parsing - which language for good efficiency ?
Thanks all of you guys for the answers and for your time! I think I'll go for the Perl-based solution (since I don't know Python or C++). Generally speaking, if instead of having just 10K reads to look for, I had 100K or millions of them, what would be the best solution?
I have a visual app (C: Efficiently process (view, analize, clip ends, convert, demultiplex, dereplicate) that shows all reads in a FastQ file. It takes about 100 seconds to open a 0.5GB file (3.3 mil reads) on an old Toshiba laptop (but when you open the file next time, it takes below 1 sec to open it). I would like to test it on a 19GB file, but I don't have such file.
So, is this fast or slow? I couldn't find any real numbers on this forum to compare mine.
waiting 100 seconds in a graphical user interface just to open a file is too long, a common mistake that people make is that they think that one would need to process the entire file to characterize it
read in 100k lines and report on that, then in the background continue the loading and update the results
You need 100 seconds when you open the file for the first time. Then is less than 1 sec. Plus, that no optimizations have been applied yet. I hope I can cut the time in half.
Anyway, your idea is great. I will do that.
The time is now, after some optimizations, 23.6 seconds (for the same file). Can I post this app here (in a new thread) so I can get feedback? There are any BioStars rules against this?
Sure just make it a Forum post
Thanks Albert. I post it here: Efficiently parsing and viewing a large fastq file
You can always concatenate files downloaded from SRA or generate one with random sequences/qualities to obtain a file of any size.