I just assembled a plasmid using Illumina 2x250 PE. I am pretty confident that the assembly is fine.
I check by mapping the raw data to the assembly. In general I have a very high coverage (sometimes more than 1000x). Also I have very few unmapped pairs.
But I have some regions where the coverage drops to 10-20 fold coverage. This looks concerning, but I still think my assembly is fine because:
1. I this areas I also see barely unmapped pairs
2. I did not see more missmatches from the reads to the assembly sequence as in other regions
Now my questions:
1. What are the properties of sequences where Illumina gives a lower coverage? Any paper? Or a SW to check?
2. How else can I check if my assembly is fine in this low coverage areas?
Thanks thackl! So low/high GC content is mostly the problem? Any other factors?
I'd say GC is the most common, particularly if you observe some local valleys in coverage.
Of course there are more sample specific issues - e.g. stuff, like sugars etc. that sticks to DNA an creates biases during DNA extraction, but that is probably not the case here.
One other thing could be structural heterogeneity. If you for example have two versions of a plasmid, one with a larger deletion and one without. If you assemble the longer variant and than map, you will get low coverages at the deletion region... But in this special case you would get sharp coverage drops with split mapped read at the ends etc...