I am using REAPR software to assess correctness of my de novo genome assembly from HiSeq4000 Illumina reads - this software simply compares theoretical predictions of uniform coverage to actual observed coverage based on mapping component reads back to the inferred assembly. The coverage Vs GC plot from this mapping is shown at this link - https://ibb.co/k3Efvb
This means there is GC-bias in coverage exists, but is very minimal because the slope (solid black line) is almost zero, correct? I find it surprising that average fragment coverage here is < 5X, whereas crude per-nucleotide coverage, based on Lander/Waterman equation C = LN/G =~ 90X!
If indeed there is some GC bias contributing to coverage depth differences, then is it beyond what is to be expected of my data (haploid, fungal spores) or any data of this type? I ask because there are ~ 90K errors reported by REAPR where observed fragment coverage is ZERO, where the assembly should be, but is NOT broken or gapped at these positions! These are obvious mis-joints, right?
Is it known how and why GC-content dependent coverage bias can result in mis-assemblies during de novo assembly? I've tried a5, a5miseq, ABySS, SPAdes with different pre-processing variations - but to no avail, REAPR always reports about 90K errors.
Related to this potential GC-bias introduced-error, are there other software such as deep Tools' computeGCBias and correctGCBias and GCcorrect, for my purposes? And what is the general consensus about the need and efficacy of such GC bias correction steps / tools - yes, no or dependent on data?
And finally, if it is recommended, at what stage of my pre-processing is this suggested with respects to other steps such as adapter-removal, quality trimming, length filtering, filtering tile, Error Correction etc? THANKS!
Indeed, Brian, I have tried your assemblyPipeline.sh steps apart from other combinations of (assemblers+pre-processing steps).
Each time, I obtain REAPR results that indicate mis-assemblies and mis-joints that should really be gaps - on the order of 80-90K, for each of my tested de novo assembly protocol - with a5, a5miseq, SPAdes, ABySS, and your pipeline with more involved pre-processing.
So my problem appears to be the reverse of what should be expected? Rather than holes in my assembly, several of these expected holes are joined together incorrectly, and REAPR is calling out these errors? That is very strange, if I am not mistaken!
Well, there are two possibilities... either REAPER is wrong, or preprocessing is causing more misassemblies. Either is possible. In general overly-aggressive preprocessing can cause more misassemblies, while simultaneously increasing continuity, and finding the right balance can be difficult. The steps I listed I tend to be beneficial in bacterial isolates and metagenomes. I have not tested them on haploid fungi, but... how do you sequence a fungal spore, anyway? Does it have just one copy of DNA that needs to be heavily amplified with MDA, or do you culture it, or what? Some preprocessing does not play well with heavy amplification.
I generally recommend when qualifying a pipeline, to start with data for which you know the correct answer - e.g., first sequence something with a finished reference, optimize the pipeline, then run your real sample through the optimized process and hope for the best since it is very difficult to confidently call misassemblies without a reference.