Hello,
I have performed some genome assemblies using canu
assembler and i am not able to fully understand its output files. I did find a 7/8 year old post (can't find it again) but i personally think that things might have changed since then so I am looking for current experience from you.
Organism: Fungus fusarium Data type: Nanopore ONT
After a successfull genome assembly i got the following files in output directory
- NP01.contigs.fasta
- NP01.contigs.layout.readToTig
- NP01.contigs.layout.tigInfo
- NP01.correctedReads.fasta.gz
- NP01.trimmedReads.fasta.gz
- NP01.unassembled.fasta
Looking around, i know that NP01.contigs.fasta is the main assembly file. I am not sure about the NP01.unassembled.fasta
Question 1: What is this unassembled file about. Should i consider using this file like
- polish the unassembled contigs using Racon+Raw ONT reads
- then merge them into main assembly file ?
Or just ignore this file and use the NP01.contigs.fasta for downstream processing and polishing ?
Question 2: What are the corrected reads? I mean what is a general process which generates corrected reads or how canu corrects teh reads ?
Question 3: What is the use case of readToTig and tigInfo files ?
Other Information Required
I did assemble the same sample with flye
assembler and for both canu
and flye i got assembly stats via Quast
and got alot different results as below.
Question 4: Why is there such a big stats difference between Canu and Flye assembly ? Specifically # or contigs and Genome size and N50 value.
Does this mean that Flye did a better job at assembling the genome ?
Question 5: Can i use this Flye assembly for scaffolding of canu assembly using tools like RagTag ?
Question 6: for the polishing of assembly, should i use the RawONT reads or the corrected reads ?
Thank YOu
Hi. My BUSCO results show above 99% completness in each assembly.
I mapped the original raw ONT reads to the assembly (canu assemmbly after 1 round of racon polish using ONT Reads)
From the generated BAM file i ran Qualimap and got yhe following coverage histogram.
The small peaks arund the Main peak indicate the Haplotypic Duplications ?
Did the same for Flye assembled genome (Canu corrected and trimmed Reads -> Flye assembly -> Longstitch Scaffolding ) Mapped Raw Reads to fasta and generated Histogram in Qualimap.
Quast stats look like this.
Can you indicate which is better? and Why ?
It's hard to understand something from your coverage histograms because the bin size is too large. You can set "--cov-hist-lim 250" to make a better diagram.
Completeness is not the only important metric of BUSCO. Please, provide the full string of BUSCO results, like "C:99.5%[S:95.5%,D:4.0%],F:0.2%,M:0.3%,n:425,E:3.3%".
Hi, here are all the results.
--cov-hist-lim 250
looks like this--cov-hist-lim 250
looks like thisCan you indicate now which assembly is better? My goal is to get aas close to chromosome level assembly as possible.
Could you please attach diagrams named "Coverage Histogram (0-250X)"?
Hi, I update the images in above shared results.
There are a lot of regions with low coverage in the Canu assembly. I suppose that this can be contamination. Maybe, the algorithm of Flye filtered out sequences with low coverage, while the algorithm of Canu didn't do this. I think, it's worth finding several Canu contigs that have low coverage to understand what species they belong to. Maybe they belong to bacteria or something else.
Overall, the assembly of Flye is probably better.
Thank you for your insightful feedback.