I am attempting to assemble a 30Mega base genome using Illumina paired-end data. In order to get decent depth of coverage, the sequencing was done twice and we have four .fastq files. F & R for version one (roughly 80% coverage) and F & R for version two (~20% coverage). I'm unsure how to assemble from these data, should I concatenated all forward and reverse reads, from the two version, together, then merge using PEAR...etc. and then assemble or merge the two sets of F & R reads for each version, then assemble?
I disagree with this advice; merging can substantially improve assembly, though it depends on the specific assembler, merging tool, and insert size distribution.
True. Most of the time you wouldn't merge though. For example, you may lose the repeats. Optimally, your fragment size should be longer than both of your reads. In that case, merging would not work at all.
Sometimes having fewer reads will also improve the assembly, but you wouldn't advise people to generate fewer reads.
Actually... when fewer reads improve the assembly, I would advise people to reduce the number of reads, either by normalization or subsampling. At that point it's too late to generate fewer reads. And the optimal fragment size is not necessarily longer than both reads, specifically because merging can be beneficial. Because the error rate increases toward the end of the read, which is the part most likely to overlap, merging can substantially reduce the overall error rate.
I'm currently doing an analysis of optimal preprocessing of metagenomes prior to assembly. So far, on multiple datasets, merging is universally beneficial for Spades (an assembler that makes very good use of paired reads). It's also universally beneficial for Tadpole (which does not make use of paired reads). It appears to be neutral or detrimental to Megahit. However, even then, running BBMerge with the "ecco" flag (which error corrects read pairs in the overlap region via consensus, but outputs the reads as a pair rather than merging them) is universally beneficial to Megahit in my tests so far. Purely overlap-based error-correction is only possible with overlapping paired reads.
From prior data, it seems that merging is detrimental to Soap but beneficial for Ray. I have not tested the "ecco" mode with Soap, though.
So, yes, I would recommend that people design their libraries to overlap for assembly projects to take advantage of this, which is why JGI designs their libraries to overlap. I'm not really sure what you mean by "For example, you may lose the repeats." I have not seen this occur nor can I think of a reason why it might be incurred by merging. And lastly, it actually is possible to merge non-overlapping reads; BBMerge can do this, for example, with the "rem" flag. This greatly improves Tadpole assemblies because it allows the use of much longer kmers, and Tadpole is reliant on long kmers for a good assembly because it does not perform any graph simplification. I've not tested it thoroughly with other assemblers, but typically I find the effects of preprocessing to be highly correlated between Tadpole and Spades.
Thanks for clarifying!
I don't mean to imply you are wrong, but if the merging is beneficial, why do the assemblers not explicitly recommend it in their documentation or just implement it themselves? For example, SPAdes has a few pre-processing steps. Why not add read merging to that workflow?
Allpaths-LG did merging internally (and explicitly requires libraries with a substantial fraction of overlapping reads), and I think there is at least one other assembler whose name I forget that does it as well. And it is almost universally performed (or should be) prior to overlap-based assembly. I'm not sure why it is not explicitly recommended by assemblers, but perhaps the reason is because assembler developers tested with a merging tool that had a high false-positive rate and concluded that merging led to inferior assemblies, due to the creation of chimeras. Unlike other preprocessing steps, merging can introduce new errors, which is why BBMerge is very conservative.
Also, some tools like Spades are designed around certain assumptions, like a bell-shaped distribution of paired insert sizes. Merging will destroy that assumption, but even so, it still improves the assembly! Similarly, the Spades team does not recommend or internally perform normalization, partly because it messes with their path-simplification heuristics. Even so, in some cases normalization improves single-cell assembly (more often it's neutral or marginally worse) and in all cases the normalized assembly uses vastly lower resources (time and memory), which often means the difference between an assembly and no assembly. I don't recommend normalization as a universal preprocessing step; the point is that it is often useful for Spades, but explicitly not recommended. Why? Well, extensively testing all of your assumptions (particularly those you designed an algorithm around!) is very time-consuming; I'm sure the team is busy improving other aspects of Spades.
From SPAdes changelog for SPAdes 3.12.0 (May 2018):
And now in the manual: