Hi, community! I am de novo assembling Nanopore long reads and I am comparing my draft genome assembly against the online available reference genome. First, I want to give you some details about the inputs. The reference genome used a hybrid method that compromised Illumina and PacBio; they assembled the short reads and used the long reads (~30x coverage) to gap-filling. In my assembly, I could find complete chromosomes but nothing on the reference genome. The genome annotation of the reference genome could annotate more protein-coding genes than my own assembly. Both assemblies are from the same species but different strains and they are different in genome size.
When I ran nucmer from MUMmer with the option -maxmatch and delta-filter I got an average identity of 93%. How is this possible? I find it difficult to understand because:
1) My assembly had ~47x coverage which according to the literature I needed 70x coverage to overcome the systematic errors of Nanopore so my assembly has errors even though I did a lot of error-correction, consensus and polishing steps.
2) With the Nanopore assembly, I could span more repetitive regions than the reference genome.
So, In general, there are different reasons why I would have expected a little bit less of the identity percentage.
Let me know if I made myself clear. Thank you in advance for your help.
It's important to realize here how fast the field changes. Nanopore errors have become a lot less problematic than some old literature will make you think. I don't know the paper you are talking about here, but make sure to take into account that the technology was of lesser quality when those people generated their data.
Hi! Thank you for your answer. Actually, I am dealing with that coverage value. What is low or high coverage to overcome systematic Nanopore errors? Could you suggest me a paper which clarifies this? The one I read is from 2015.
Thank you!