How to compare the quality of assemblies
2
2
Entering edit mode
7 months ago

We've just completed the de novo assembly of an insect species. We utilized PacBio HiFi reads with approximately 50X coverage, estimated based on the genome size of a closely related species. The k-mer genome size estimation using Illumina reads indicated about 250 Mb.

The sample was pooled from the population of about 50 individuals, so it's highly heterozygous.

1) First We assembled with Hifiasm with the following 2 rounds of purging with purge_dups. The main genome assembly statistics are: a) contigs: 362 b) Total Length: 357 Mb c) N50: 8.3 Mb d) L50: 12. e) N90: 548k f) BUSCO metazoa: C:95.6%[S:94.8%,D:0.8%],F:0.6%,M:3.8%.

2) Then we used NextDenovo without purging. The statistics we got: a) contigs: 426 b) Total Length: 224 Mb c) N50: 19.1 Mb d) L50: 5. e) N90: 125k f) BUSCO metazoa: C:97.5%[S:96.8%,D:0.7%],F:0.7%,M:1.8%.

So, excluding the numerous small contigs the NextDenovo assembly is better. But nonetheless the metrics of Hifiasm assembly not too bad, and I'm confused by the 130Mb difference in total genome size.

While I lean towards choosing the NextDenovo assembly for subsequent analyses such as annotation and FISH mapping, I can't dismiss the significant difference in size between two relatively good assemblies

nextdenovo assembly hifiasm pacbio • 1.1k views
ADD COMMENT
1
Entering edit mode

Are there any short read datasets available in SRA? Can you try and see how many left over reads remain after you align to both of your assemblies? What kind of coverage do you get.

ADD REPLY
2
Entering edit mode
7 months ago
shelkmike ★ 1.4k

Since BUSCO results are almost identical but the assembly sizes differ significantly, I suppose that the difference between the assemblies is due to incorrect assembly of repetitive regions. Either NextDenovo made too few copies or Hifiasm made too many. You can do, for example, the following:
1) Calculate the average sequencing depth in BUSCO genes.
2) Calculate the average sequencing depth of the entire genome.
The assembly where the value "2)" is closer to "1)" has more properly assembled repeats.

ADD COMMENT
0
Entering edit mode
7 months ago

Are you sure you have looked into the HiFiasm output files (fasta and GFA) in detail and excluded non-primary contigs, hap1, hap2 and other contigs ? I know you did purge_dups.

Be sure to check the different contig types here and remove any non-primary contigs you may not need. Is your goal a haploid or diploid assembly ?

https://hifiasm.readthedocs.io/en/latest/interpreting-output.html#interpreting-output

ADD COMMENT
0
Entering edit mode

A haploid assembly is our goal. For purging we used prefix.p_ctg.gfa (assembly graph of primary contigs).

ADD REPLY

Login before adding your answer.

Traffic: 1973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6