Hello Reader,
I have de-novo assambled genomes of some fungal samples using Flye 2.9.5-b1801
For purpose of this question Sample=NP06
assembled with following command:
flye --nano-raw NP06_merged_pass.fastq --out-dir ./flye_assembly/NP06 --threads 48 --genome-size 60m --iterations 3 --scaffold
Following are the stats with BUSCO_v5.7.1
**BUSCO**
C:99.4%[S:98.9%,D:0.5%],F:0.1%,M:0.5%,n:4494
4469 Complete BUSCOs (C)
4446 Complete and single-copy BUSCOs (S)
23 Complete and duplicated BUSCOs (D)
6 Fragmented BUSCOs (F)
19 Missing BUSCOs (M)
4494 Total BUSCO groups searched
Following are the stats Quast_v5.2.0
and
Assembly # contigs Largest_contig Total length GC (%) N50 N90 auN L50 L90 # N's per 100 kbp
NP06_flye 112 6680635 49857017 47.54 4243289 333467 3938452.1 5 18 0
I wanted to verify the assembly to see if there are any chimeric contigs in assambled genome so i compared it to closest reference genome which has total 15 chromosomes (11 core) and 4 accessory (chr 3,6,14,15).
Synteny was identified between genomes usinf satsuma2
available here and visualized via Circos
The results look like this.
In this image Right side NP06 is my assembly and Left side Fol4287 is reference
Marked contigs are point of concern. out of 12 Flye assemblies, 3 show something similar.
Questions:
- What can be the reason for this merging of contigs ?
- how this can be resolved?
- Does presence of these merged contigs mean that my genome assembly is wrong ?
- Is there any other way/tool to verify genome assemblies ?
I also tried similar approach using Nucmer
from Mummer2 package for the Genome to genome alignment and then visualization via circos with min_identiy=98%
and min_length=2000
as filter parameters and got similar results or merged contigs.
Your helpful insight will be helpful to me.
Regards
Just to add on to above information, I have tried another assembler
NextDenovo_v2.5.2
and then polished the assembly with same Nanopre + Illumina SR data usingNextPolish_v1.4.1
usingbest
option.Later performed similar synteny analysis using
Satsuma2
andcircos
and results show similar trend of chimeric contigs. image attached below.Do you have a reason to believe that your strain is going to look identical to the published relative. If two separate assemblers are showing similar results perhaps that is the truth. You could do some long range PCR etc to verify if that is accurate.
Actually the trend is not similar across other samples. for example
So this makes me worry if the assemblies itself are not correct or there can be some other biological reason or technical reason for this kind of patterns.
I have also tried to look for any Ns strings in fastA file but there were non.
Recently I experienced something similar assembling some strains of a filamentous fungal pathogen. Flye and nextDenovo assembled many chrimeric contigs. My interpretation was that they were misassemblies, because not the same chimeric contigs were produced. I couldn't figure out why Flye and nextDenovo were merging contigs from difference chromosomes, so I ended up using the assemblies from Canu, which were essentially free of chrimeric contigs.
Hi.
I have tried canu as first assembler but canu gives alot of contigs. Like in 300 to 400.
As far as i under Canu generates a new contig for every haplotig (if it finds SNP in reads). How to deal with that ? Did you use Purge Haplotigs ?
Yes, I observed that Canu, differently from Flye and nextDenovo, reported 10 to 50 small contigs shorter than 50 kb or so. I just kept them in the assembly.
In my case Canu didn't try to assemble haplotigs, because the genome is haploid. If your sample is diploid, or is cross-contaminated, then yes I expect different haplotigs would start to appear in the assembly.
I have canu assemblies too as my fungus is also filamentous and haploid. After assembly i mapped raw_reads back to assembly and generated coverage histogram using
qualimap
and it kinda looks like this.Because of the small peak at left of graph, i assume that there are repeats/ or haplotigs in the assembly.
what do you suggest.
You could check which contigs are giving you the left peak; I'd expect it's coming from short contigs that are basically repetitive DNA. If qualimap does not give you this info, you could try mosdepth to do that. You could also try to remove the short contigs if they are contained within larger contigs using dedupe.sh with >=99% identity or so.
I am not very experienced with this but you could map your reads (used to create the assembly) back to your assembly and then look for any weird patterns (might look kinda like signatures of a structural variant, like reads pairing across contigs) that might be coming up that might suggest misassembly
Hi. Thank you for suggestion. Should I map the Long-Reads back to assembly or the corresponding Short-Reads ?
Also, how to check for these patterns ? In qualimap report generated from the bam file or is there any other way ?