Hello everyone !
I am currently de novo assembling a drosophila genome, with a high level of polymorphism, using long PacBio reads. As my sequencing coverage is enough (220x), I chose to use the "PacBio" only method, which consist in "polishing" the final assembly by aligning raw reads against it and correct base by base. The algorithm performing this step is called quiver.
After polishing my assembly, I decided to align RNAseq reads against it and to check what was going on. I realised that in some particular region, some indels remains and refuse to disappear. I have tried to increase the polishing coverage to deal with these indels, but it didn’t worked. I don't have enough coverage with Illumina reads, so I can't use Pilon to correct theses regions using short reads.
Also, manually by looking at it in IGV, I couldn’t find any others example that the particular region I am talking about.
Does anyone ever experienced that problem ? Any idea about how to calculate indel rate without a reference genome ? How can I find some others regions with high indel rate ?
I hope that I was clear, don't hesitate to ask me any question to clarify myself,
Cheers,
Roxane
We found a similar situation once when we looked into genes (~3kbp) which were present in multiple copies. In that case there were >100 copies of that gene and Quiver did not manage to determine for some of them the correct sequence. This is likely the case if two regions are too similar and mapping becomes ambiguous. That said, we made 5 iterations of polishing (it was a smaller genome) and many of them got polished as a correction helps in the next round to determine better mapping of previously ambiguous reads. But for a few SNPs it flipped back & forth not being capable to find the correct sense. We finally did manual curation and PCR - probably not what you want to do...
Can you comment on the size/numbers of these indels?
I assume your genome sequence did not come from a single fly and was a mix of a bunch.
Indels are like a single nucleotide inserted, or missing, it is not assembly wide, just in a particular regions that I was able to find this. The numbers, I can't say exactly, because I am not sure that there is not other case else where in my assembly. In the concerned region, it was like maximum 10 (of a single insertion or deletions, rarely 2 nucleotide).
Indeed, it come from a bunch of several flies, that is why polymorphism can be a issue.
In this case, I suspect more the sequencing technology, with its higher error rate for indel, to cause this issue. The problem is that it is supposed to be solved using quiver. But in my particular region it is not...
Is it better the way I explained ?
If your RNAseq reads (which I assume are illumina) are consistently showing the same sequence then would it be better to trust them (especially if the differences with the reference are only a single bp)?
Yeah of course, I know I should trust them. I want to correct indels regarding RNAseq mapping indications. If RNAseq mapping indicate that there is an indel, then I want to correct it. But I want a way to estimate how many indels like this I have in my assembly, and how to correct them without doing it manually.