Canu genome assembly output
1
0
Entering edit mode
3 months ago
Umer ▴ 130

Hello,

I have performed some genome assemblies using canu assembler and i am not able to fully understand its output files. I did find a 7/8 year old post (can't find it again) but i personally think that things might have changed since then so I am looking for current experience from you.

Organism: Fungus fusarium Data type: Nanopore ONT

After a successfull genome assembly i got the following files in output directory

  • NP01.contigs.fasta
  • NP01.contigs.layout.readToTig
  • NP01.contigs.layout.tigInfo
  • NP01.correctedReads.fasta.gz
  • NP01.trimmedReads.fasta.gz
  • NP01.unassembled.fasta

Looking around, i know that NP01.contigs.fasta is the main assembly file. I am not sure about the NP01.unassembled.fasta

Question 1: What is this unassembled file about. Should i consider using this file like

  • polish the unassembled contigs using Racon+Raw ONT reads
  • then merge them into main assembly file ?

Or just ignore this file and use the NP01.contigs.fasta for downstream processing and polishing ?

Question 2: What are the corrected reads? I mean what is a general process which generates corrected reads or how canu corrects teh reads ?

Question 3: What is the use case of readToTig and tigInfo files ?

Other Information Required

I did assemble the same sample with flye assembler and for both canu and flye i got assembly stats via Quast and got alot different results as below.

enter image description here

Question 4: Why is there such a big stats difference between Canu and Flye assembly ? Specifically # or contigs and Genome size and N50 value.

Does this mean that Flye did a better job at assembling the genome ?

Question 5: Can i use this Flye assembly for scaffolding of canu assembly using tools like RagTag ?

Question 6: for the polishing of assembly, should i use the RawONT reads or the corrected reads ?

Thank YOu

flye assembly fusarium canu fungus • 916 views
ADD COMMENT
1
Entering edit mode
3 months ago
shelkmike ★ 1.4k

I haven't been using Canu for many years (because NextDenovo and Flye usually make better assemblies), so I cannot answer the questions regarding Canu.
To understand why the assembly sizes differ, I recommend to align reads to the assemblies and make coverage histograms (for example, with Qualimap). If in the Canu assembly you see an additional peak with coverage two times lower than the main peak, this means that there are haplotypic duplications (speaking simply, it's when alleles are mistakenly assembled separately as paralogs). If you don't see that peak, this would mean the the assembly of Canu is probably correct and Flye overcollapsed some repeats.

Also, to understand which assembly is better it's worth running BUSCO.

Some additional ideas:
1) If your reads are from 10.4.1 Nanopore flowcell, I recommend to correct errors in reads using "dorado correct" (https://github.com/nanoporetech/dorado) and then assemble the genome using Hifiasm.
2) If your reads are from an older flowcell type, I advise to try NextDenovo. In my experience, it usually assembles eukaryotic genomes better that Flye and Canu.

ADD COMMENT
0
Entering edit mode

Hi. My BUSCO results show above 99% completness in each assembly.

I mapped the original raw ONT reads to the assembly (canu assemmbly after 1 round of racon polish using ONT Reads)

From the generated BAM file i ran Qualimap and got yhe following coverage histogram.

The small peaks arund the Main peak indicate the Haplotypic Duplications ?

Canu assembly coverage histogram

Did the same for Flye assembled genome (Canu corrected and trimmed Reads -> Flye assembly -> Longstitch Scaffolding ) Mapped Raw Reads to fasta and generated Histogram in Qualimap.

Flye assembly coverage histogram

Quast stats look like this.

QUAST stats

Can you indicate which is better? and Why ?

ADD REPLY
0
Entering edit mode

It's hard to understand something from your coverage histograms because the bin size is too large. You can set "--cov-hist-lim 250" to make a better diagram.
Completeness is not the only important metric of BUSCO. Please, provide the full string of BUSCO results, like "C:99.5%[S:95.5%,D:4.0%],F:0.2%,M:0.3%,n:425,E:3.3%".

ADD REPLY
0
Entering edit mode

Hi, here are all the results.

  1. CANU assembly QUAST and BUSCO stats (same as above)
    Assembly    # contigs (>= 0 bp) # contigs (>= 1000 bp)  # contigs (>= 5000 bp)  # contigs (>= 10000 bp) # contigs (>= 25000 bp) # contigs (>= 50000 bp) Total length (>= 0 bp)  Total length (>= 1000 bp)   Total length (>= 5000 bp)   Total length (>= 10000 bp)  Total length (>= 25000 bp)  Total length (>= 50000 bp)  # contigs   Largest contig  Total length    GC (%)  N50 N90 auN L50 L90 # N's per 100 kbp
    NP01_canu   317 317 316 309 217 138 63600520    63600520    63598461    63542228    61704735    58995096    317 4237331 63600520    47.49   1512956 70450   1586498 14  109 0
    
    BUSCO NP01_canu stats:      C:99.9%[S:98.3%,D:1.6%],F:0.0%,M:0.1%,n:4494
    
    and Histogram with --cov-hist-lim 250 looks like this Canu assembly and Raw ONT reads alignment BAM file histogram generated with Qualimap
  1. FLYE assembly QUAST and BUSCO stats (same as above)
    Assembly    # contigs (>= 0 bp) # contigs (>= 1000 bp)  # contigs (>= 5000 bp)  # contigs (>= 10000 bp) # contigs (>= 25000 bp) # contigs (>= 50000 bp) Total length (>= 0 bp)  Total length (>= 1000 bp)   Total length (>= 5000 bp)   Total length (>= 10000 bp)  Total length (>= 25000 bp)  Total length (>= 50000 bp)  # contigs   Largest contig  Total length    GC (%)  N50 N90 auN L50 L90 # N's per 100 kbp
    NP01_flye   101 98  90  74  60  48  56101640    56099327    56077222    55967757    55724622    55273737    101 6501571 56101640    47.83   4232660 394633  3824795.5   6   20  0.18
    
    BUSCO NP01_flye stats: C:99.9%[S:99.0%,D:0.9%],F:0.0%,M:0.1%,n:4494
    
    and Histogram with --cov-hist-lim 250 looks like this FLye assembly and Raw ONT reads alignment BAM file histogram generated with Qualimap

Can you indicate now which assembly is better? My goal is to get aas close to chromosome level assembly as possible.

ADD REPLY
0
Entering edit mode

Could you please attach diagrams named "Coverage Histogram (0-250X)"?

ADD REPLY
0
Entering edit mode

Hi, I update the images in above shared results.

ADD REPLY
1
Entering edit mode

There are a lot of regions with low coverage in the Canu assembly. I suppose that this can be contamination. Maybe, the algorithm of Flye filtered out sequences with low coverage, while the algorithm of Canu didn't do this. I think, it's worth finding several Canu contigs that have low coverage to understand what species they belong to. Maybe they belong to bacteria or something else.
Overall, the assembly of Flye is probably better.

ADD REPLY
0
Entering edit mode

Thank you for your insightful feedback.

ADD REPLY

Login before adding your answer.

Traffic: 2058 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6