Entering edit mode
4.7 years ago
zhangdengwei
▴
210
Hi, all
I am merging two fastq files, either of which is more than 10 GB, into one file. I use cat
to realize it, but this process took a long time, even more than 3 hours. Is there any approach to speed up it?
BTW, my command is
cat read1.fq read2.fq > merged.fq
Thanks in advance!
That's as fast as you can go. If it's still too slow, that probably means the storage you're running on doesn't have sufficient I/O, which could either be because it is of low quality or because other processes (other users or yourself) are using up all the I/O.
Thanks. Maybe the cause is indeed insufficient I/O. The other processes nearly run out of the I/O. I naively suppose that
cat
only consumes little CPU and ignore this case. I will try it again, with shutting down other processes.You can monitor this with
ls -lh
to see how fast the new file grows larger. Normally this should be several dozens of megabytes per second. Large files take time, not much you can do about it.Thanks very much. Your comment is pretty helpful to me.
No, this is already as lowlevel as it gets. If files are big, then you have to be patient. Maybe you are having I/O bottlenecks. Is this on a HDD drive?
Thanks for your advice.
What are you merging?
read1.fq
andread2.fq
are the forward and reverse reads of a sequencing run? Why are you merging them? What are the downstream analyses you want to perform?metaphlan2 and humann2
I would reserve the term merging to when R1 and R2 are merged based on their overlap, as this is the currently adopted practice.
Keep in mind:
you are throwing out information when you concatenate the files (although it is true metaphlan2 and humann2 do not use this information).
these concatenated files have very restricted uses, as they break the pairing between R1 and R2, and most programs expect this pairing.
edit: I would use just the R1 file for analyses like metaphlan2 and humann2.
edit 2: it seems humann2 recommendation is the opposite of what I suggested above:
Indeed, the recommendation of
humann2
is the merged file. I'm not clear about the difference using one or two reads, does it impact the result? Have you tested it?It is clear, it is immediately above the manual snippet I pasted above:
I haven't tested, though. As bacterial genomes are very dense in coding material, I would expect the difference between just R1 versus R1+R2 to be rather small. Maybe the HUMAnN2 authors tested this on their manuscript?