[PacBio assembly] Remaining indels after polishing
3
0
Entering edit mode
7.5 years ago
Rox ★ 1.4k

Hello everyone !

I am currently de novo assembling a drosophila genome, with a high level of polymorphism, using long PacBio reads. As my sequencing coverage is enough (220x), I chose to use the "PacBio" only method, which consist in "polishing" the final assembly by aligning raw reads against it and correct base by base. The algorithm performing this step is called quiver.

After polishing my assembly, I decided to align RNAseq reads against it and to check what was going on. I realised that in some particular region, some indels remains and refuse to disappear. I have tried to increase the polishing coverage to deal with these indels, but it didn’t worked. I don't have enough coverage with Illumina reads, so I can't use Pilon to correct theses regions using short reads.

Also, manually by looking at it in IGV, I couldn’t find any others example that the particular region I am talking about.

Does anyone ever experienced that problem ? Any idea about how to calculate indel rate without a reference genome ? How can I find some others regions with high indel rate ?

I hope that I was clear, don't hesitate to ask me any question to clarify myself,

Cheers,

Roxane

PacBio Assembly polishing quiver • 4.6k views
ADD COMMENT
1
Entering edit mode

We found a similar situation once when we looked into genes (~3kbp) which were present in multiple copies. In that case there were >100 copies of that gene and Quiver did not manage to determine for some of them the correct sequence. This is likely the case if two regions are too similar and mapping becomes ambiguous. That said, we made 5 iterations of polishing (it was a smaller genome) and many of them got polished as a correction helps in the next round to determine better mapping of previously ambiguous reads. But for a few SNPs it flipped back & forth not being capable to find the correct sense. We finally did manual curation and PCR - probably not what you want to do...

ADD REPLY
0
Entering edit mode

Can you comment on the size/numbers of these indels?

I assume your genome sequence did not come from a single fly and was a mix of a bunch.

ADD REPLY
0
Entering edit mode

Indels are like a single nucleotide inserted, or missing, it is not assembly wide, just in a particular regions that I was able to find this. The numbers, I can't say exactly, because I am not sure that there is not other case else where in my assembly. In the concerned region, it was like maximum 10 (of a single insertion or deletions, rarely 2 nucleotide).

Indeed, it come from a bunch of several flies, that is why polymorphism can be a issue.

In this case, I suspect more the sequencing technology, with its higher error rate for indel, to cause this issue. The problem is that it is supposed to be solved using quiver. But in my particular region it is not...

Is it better the way I explained ?

ADD REPLY
0
Entering edit mode

If your RNAseq reads (which I assume are illumina) are consistently showing the same sequence then would it be better to trust them (especially if the differences with the reference are only a single bp)?

ADD REPLY
0
Entering edit mode

Yeah of course, I know I should trust them. I want to correct indels regarding RNAseq mapping indications. If RNAseq mapping indicate that there is an indel, then I want to correct it. But I want a way to estimate how many indels like this I have in my assembly, and how to correct them without doing it manually.

ADD REPLY
1
Entering edit mode
7.5 years ago

I've experienced some problems with seeing indels in a PacBio assembly after correcting with Pilon, when comparing the genome sequence to Illumina RNAseq data, in that case it looked like the high G/C content in that region lead to indel-errors/sequencing artefacts in the Illumina data. Could that be an issue here as well?

ADD COMMENT
0
Entering edit mode

In fact, I didnt used Pilon to correct the data, I used quiver with pacBio reads, so our approaches are different. The pacbio sequencing coverage is supposed not to be impacted by GC%. I don't really know what is going on there

ADD REPLY
1
Entering edit mode
7.3 years ago

I encountered the same problem than you. If you have Illumina reads, I advise you to correct the indels with this tool : https://github.com/douglasgscofield/PacBio-utilities

That solved this problem for me. Best, A.V.

ADD COMMENT
0
Entering edit mode

Great ! I will have a look on that, thanks for your advices ! Do you have also an idea to evaluate the indel rate before and after this processing ? So I can measure the efficience of the method ?

ADD REPLY
0
Entering edit mode

Hi Amandine !

Si I have recently tried the tools you advised me, but it appears it didn't correct anything in my assembly... I was in possession of enough Illumina reads, so I don't really understand why it didn't solved my problem... The remaining indel rate is still 1 indel out of 1 kB, unevenly distributed on the genome.

Do you have any more advices ?

ADD REPLY
0
Entering edit mode

I actually contacted PacBio themselves about that problem, here is their answer :

The most likely explanation for these indel errors is that they are occurring at the site of a hetrozygous SNP. Quiver was built as a haploid consensus algorithm that understands that our sequencer will most often generate deletion errors, then insertion errors, and then very rarely SNP errors. When it comes across a site of a heterozygous SNP in your data, sometimes it will conclude that the conflicting nucleotides are actually best explained by a large number of insertion errors from the sequencer rather than as a high number of substitution errors, and then it deletes the base to "fix" this mistake.

But while calculating the indel rate using Alfred (https://github.com/tobiasrausch/alfred), the insertion rate was 0.00068 and the deletion rate was 0.00026 (I guess it is calculated by base), which is still really high.

I start wondering if this is a assembly/polishing issue or if their is some strange stuff within theses regions...

ADD REPLY
0
Entering edit mode
6.3 years ago
Rox ★ 1.4k

Hi again Biostar.

I completely forgot to give any feed back on that thread, so here are some more informations. I was able to get more clues about thoses remaining indels after polishing. I noticed theses indels were always located in contigs associated with 2 chromosomes where the polymorphism level were still very high.

On the figure I uploaded, we can see in pink a representation of the polymorphism (it's a "rate of the major base" : 1 = no polymorphism) against the indel rate in blue (0 = no indels). On the left, theses are stats on contigs before polishing, and to the right, after polishing. Top pictures are a contig associated to a chromosome which polymorphism level is very low. And bottom is a contig with a lot of polymorphism.

From all the informations we can extract from thoses pictures, the one I find interesting is the fact that we have thoses very clustered regions of remaining high indel rate that correspond to regions with high polymorphism. I don't know how you guys feel about this, but it look like a kind of threeshold effect : like at some point, some inner parameters of quiver make the polishing fail/make it worse. That's my feeling when I see that "clustered" effect.

If you need any enlightments about the method I used to produce theses graphs, don't hesitate to ask me !

Hope this can be of anyhelp, also hope to have some feedback from people that may have experienced the same. I still feel like I didn't got a satisfying explanation from that phenomenom !

Cheers,

Rox

![PacBio Indels remaining after polishing localized in highly polymorphic regions][2]

indels Pac Bio

PS : I don't know how to properly include an image in here ;(

ADD COMMENT
1
Entering edit mode

For future reference: How to add images to a Biostars post

ADD REPLY
0
Entering edit mode

Thanks genomax, I was missing a step indeed !

ADD REPLY

Login before adding your answer.

Traffic: 1641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6