We're doing some sequencing and de novo assembly to compare the genomes of a globally distributed eukaryotic parasite in different populations. We've come up with a potentially interesting pattern - one population seems to have a larger genome size (8Mb vs 6Mb), more duplicate genes, and more inferred frameshifts in the assembly than the others ("normal" population has about 10 per genome, these genomes have about 300-400).
This could have an interesting biological explanation, but I'm also worried that we're just doing the assembly wrong for this population. (We are using Quake to error-correct reads, followed by SPAdes for assembly, having tried various alternatives and found this to be the best).
I'm trying to work out if the frameshifts are real or an assembly artifact. The relevant data seem to be:
- We have several isolates from the population, and all of them show the frameshifts at the same positions (they are very similar genomes overall).
- When I map the reads back to the de novo assembly and look at the frameshift positions, there seems to be good support for them - coverage is good (30-40x, which is similar to the overall coverage), and there are many reads which contain the whole insertion or deletion which has caused the frameshift.
For example:
And the reads mapping to that region (showing pretty even coverage across the deletion):
There is one case where there is an insertion which causes a frameshift, with 22 reads supporting the insertion and 4 with a gap; but other than this, they seem to look real.
Does this seem convincing, or are there further tests I could do with the reads/assembly to investigate further? Are there possible sequencing or assembly artifacts that would explain this?
I guess the best test would be to PCR over these regions from the original DNA, but I'd like to do the most thorough bioinfomatic analysis possible before asking a wet lab person to do that.
Thanks a lot!