Pilon polishing highly repetitive nanopore assembly
0
2
Entering edit mode
6.1 years ago
treitlis ▴ 40

Hi all,

I would like to ask you for some suggestions for pilon polishing of a canu assembled genome.

Long story short, we try to assemble a chloroplast genome which is extremely repetitive. We have around 3Gbp of nanopore data, and we are unable to make a single contig from the genome Before we try to manually circularize the assembly we wanted to polish it first. I used nanopolish and then I went for pilon, using illumina reads (2x150bp)

However here is the big issue. It seems like pilon manages to confirm just 76% of the data for the biggest contig in the assembly. As I read, bwa reports just the best mapping location in the genome, so if a read maps to multiple location just a single location is reported. Since the genome seems to have a lot of repetitions, it seems like bwa maps the repetitive reads just to a single location, and the rest of the areas in the genome which have these repetitions are not polished.

How I could manage to polish the entire contig, and make bwa (or some other software) to report all mapping locations? I tried to use bbmap, and set it to report all mapping locations and in this way the coverage increased to 99.8% (based on pileup.sh) However, bbmap is not a recommended mapper for pilon, and the minid for bbmap it is set to 0.76 (default) so I am worried that this type of mapping can also create a lot of issues for pilon.

I noticed this issue in other data from my genomes. Mostly I noticed it on eukaryotic data where the rRNA sequences are in multiple locations, and some of them are not polished completely, having mismatches to the rRNA sequences which were manually amplified and sequenced by sanger sequencing. If there would be some polymorphism in the genome with the rRNA, I would see this in the sanger sequencing, but there none in our data.

Any suggestions how to deal with this?

Thank you

nanopore canu Assembly genome pilon • 2.9k views
ADD COMMENT
1
Entering edit mode

Thank you for your well-written and detailed question. I have slightly adapted your title to make it more specific about what you are asking.

I am not aware of chloroplast assembly, could you elaborate on the size of the contig?

ADD REPLY
0
Entering edit mode

Thank you for your quick reply.

The chloroplast genome is probably 500 kbp (I am not sure even now, because I have also nuclear, and bacterial data in the dataset). I have two main contigs one which is 240 kbp and one which is 160 kbp and some other smaller ones for the canu assembly. The number of confirmed bases is 93% for the 240 kbp one and 85% for the 160 kbp one. I made a mistake in the first post. The 76% confirmed based comes from a 280 kbp contig assembled by miniasm. Miniasm does not use corrected reads, but the assembly was previously polished by nanopolish. This contig is actually a fusion of the two contigs from canu, in a way that the canu contigs overhang this contig (this helps me to figure out the assembly, probably). The fact that I have higher amount of confirmed bases with canu suggests that miniasm might do some misassemblies and small insertions? But still I have a decent amount of regions which are not confirmed. Actually I realized this issue when I put together the selected contigs to polish them, and some of them had really low coverage. So I decided to polish individually the two distinct assemblies. Based on the canu, it seems that miniasm does some small insertions which creates a mess in the genome, but I still have plenty of unpolished bases also in the canu assembly contigs. Actually the contig from canu which has 93% confirmed bases, has regions to which other miniasm contigs map and those contigs from miniasm have 99% coverage. It suggests that some repetitive element breaks the contigs in canu, but in miniasm this repetition is in the contig, which could create the discrepancy in coverage.

ADD REPLY
0
Entering edit mode

Hi treitlis,I meet exactly the same problem as yours. Do you have any solution now?

ADD REPLY

Login before adding your answer.

Traffic: 1250 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6