Question

Polishing PacBio assembly : The ideal coverage for higly polymorphic species

2

Entering edit mode

7.6 years ago

Rox ★ 1.4k

Hello everyone !

I need a few advices regarding the final step of my D. suzukii assembly, using long PacBio reads : the polishing step. First, let me explain how I obtained the file I am working on.

I have made two different assembly using different algorithms : Falcon and Canu. I have assessed and compared theses assembly using quast for the classic assembly metrics, and busco2 to assess gene content (using set Arthropoda and diptera). I also evaluated gene content using some handmade scripts that were looking for particular gene of interest.

The two assembly were really different, in terms of metrics and gene content, and I couldn't be happy with one or an other. I then used Mahul Chakraborty tool, called quickmerge (see here on github : https://github.com/mahulchak/quickmerge ). This tool created a merged assembly using both the advantages of each assembly. My Busco results were really nice compared to the previous one. The assembly was also more contiguous, with a greater N50 and way much less contigs.

For reminder, of for those who don't know busco, it is a tool that look for genes in your assembly, that are shared by different species (for Arthropoda, it is genes that are orthologs among the Arthropoda clad and it goes on) and always present in single copy. The genes are then categorized as following : S : Single , D : Duplicated, F : Fragmented, M : Missing. There is 800 genes assessed for Arthropoda, and 2800 genes assessed for Diptera.

My data come from a very polymorphic species, and I always tend to have high scores of Duplicated. I'm not really scared by it. What I absolutely want to reduce, are the numbers of fragmented or missing genes.

Then, using busco, I have tried different polishing using different coverage : 40X and 80X. The results are kind of confusing for me, and I need advises of an expert eye on it. Here are my different Busco Result depending on coverage :

Non polished assembly :

Arthropoda : S : 91% , D : 5.9% , F : 2.2% , M : 0.9%

Diptera : S : 87.1%, D : 5.1%, F : 5.1%, M : 2.7%

40X polished assembly :

Arthropoda : S : 89.1% , D : 9.4%, F : 0.9%, M : 0.9%

Diptera : S : 86.9% , D : 8.4%, F : 2.8%, M : 1.9%

80X polished assembly :

Arthropoda : S : 86.5%, D : 11.4%, F : 0.9%, M : 1.2%

Diptera : S : 84,6%, D : 11%, F : 2.6%, M : 1.8%

So, I am not that surprised that the more we polish, the more we get duplicated genes. My final assembly size is 280Mb, but the estimated size of the genome, using flux cytometry, is 250Mb. So, I was expecting duplicate of some polymorphic regions. What surprise me, and what I don't understand, is the variation of fragmented and missing genes. I was expecting that the more reads I will use, the less fragmented and missing genes I will get. it work for diptera clade, but not for Arthropoda. Doubling the coverage increased a little bit this number for Arthropoda, not for diptera, while keep dramatically increasing the duplicated genes in both clads.

I am confused now, because I found the BUSCO results from 40X polishing better for Arthropoda, but 80X better for Diptera. My interpretation of this, is that the polishing kind of "revealed" our true level of duplication, which is the reflect of an high polymorphism level. I think that the fact we loss a bit of genes in arthropoda set is because the sequences have maybe evolved a lot, and busco can't recognize some of the genes anymore.

I know it is a bit long to read, but I really need some outside point of view. Anyone already experienced assembly of an highly polymorphic species ? Should I keep the 40X polishing or the 80X polishing ? Or maybe continue polishing with an even higher coverage ? Any recommendations or critics about the pipeline I used ? (merging two different assembly for examples).

Thanks for reading me this far !

Cheers,

Roxane

genome assembly pacbio polishing busco • 4.5k views

ADD COMMENT • link 7.2 years ago by Rox ★ 1.4k

0

Entering edit mode

Hi Roxane,

I'm facing something similar with the BUSCO stats on my merged assemblies (quickmerge) . In my case, the number of fragmented and missing genes goes significantly higher after the merging step itself. Did you get any insight into what was going on here?

Thanks!

ADD REPLY • link 7.2 years ago by VS ▴ 740

0

Entering edit mode

Actually, yes. I did got some insight ! Let me add an answer on my own post so everyone can benefit from it. Even if my case is a bit different than yours, maybe it could help you in a way. In any cases, we can still discuss about that are try figure out what happened here. If you used quickmerge, then I strongly recommend to you to ask some advices to Mahul Chakraborty, the creator of quickmerge.

He was very nice and responsive, and helped me to got some fitter parameters for quickmerge. To have nice values, I've made for example 3 rounds of merging in the following way : 1) falcon + canu 2) merged + falcon 3) merged + canu. After that, my Busco missing and fragmented gene were nicer.

But as i said, maybe the best way to get the appropriate way corresponding to your data is to send a nice mail to Mahul ! :)

ADD REPLY • link 7.2 years ago by Rox ★ 1.4k

0

Entering edit mode

Thank you Roxanne for the reply and all the info! I'll write to Mahul.

ADD REPLY • link 7.2 years ago by VS ▴ 740

0

Entering edit mode

Hi Roxanne. Thanks for the good info! What tool(s) did you use for polishing? Pilon?

ADD REPLY • link 6.7 years ago by Eric Normandeau 11k

1

Entering edit mode

Hi Eric! To polish my assembly, I've used the Pacino polishing tool which is quiver.

ADD REPLY • link 6.7 years ago by Rox ★ 1.4k

score 3 · Accepted Answer · 2017-09-27

After a few months working on my dataset, I think I finnally managed to give myself an answer. I'm reporting here my thought so it can benefit to anyone in need.

My question was : Which polishing coverage should I use to reduce fragmented and missing genes ?

In the past, I reported results for 2 polishing coverage : 40x and 80x. I also tested 160x (I had enough coverage).

Because Diptera set is more closer to Drosophila, I've chose to keep only Diptera score in account here.

160x polished assembly

Diptera : S : 84,3%, D : 11.3%, F : 2.8%, M : 1.6%

As you can see, doubling the polishing coverage (from 80x to 160x), did not decreased as much as I expected the fragmented and missing gene. Even decreased a bit my single score. I had a talk with a man used to PacBio assembly that told me that a polishing coverage higher than 100x doesn't improve that much assembly, it can even make it a bit worse.

Considering that, I chose to keep the 80x polishing. Compared to 40x, it has better indel rate, and the polishing step main goal is to decrease this indel rate, signature of PacBio assembly.

I'm open to any discussion regarding these results !

Cheers

Roxane