Genome Assembly Verification, Chimeric contigs
0
0
Entering edit mode
8 weeks ago
Umer ▴ 160

Hello Reader,

I have de-novo assambled genomes of some fungal samples using Flye 2.9.5-b1801

For purpose of this question Sample=NP06 assembled with following command:

flye --nano-raw NP06_merged_pass.fastq --out-dir ./flye_assembly/NP06 --threads 48 --genome-size 60m --iterations 3 --scaffold

Following are the stats with BUSCO_v5.7.1

**BUSCO**
C:99.4%[S:98.9%,D:0.5%],F:0.1%,M:0.5%,n:4494       
4469    Complete BUSCOs (C)            
4446    Complete and single-copy BUSCOs (S)    
23  Complete and duplicated BUSCOs (D)     
6   Fragmented BUSCOs (F)              
19  Missing BUSCOs (M)             
4494    Total BUSCO groups searched

Following are the stats Quast_v5.2.0 and

Assembly     # contigs  Largest_contig  Total length  GC (%)   N50  N90 auN    L50  L90 # N's per 100 kbp
NP06_flye   112    6680635   49857017     47.54   4243289  333467   3938452.1   5    18    0

I wanted to verify the assembly to see if there are any chimeric contigs in assambled genome so i compared it to closest reference genome which has total 15 chromosomes (11 core) and 4 accessory (chr 3,6,14,15).

Synteny was identified between genomes usinf satsuma2 available here and visualized via Circos

The results look like this.

enter image description here

In this image Right side NP06 is my assembly and Left side Fol4287 is reference

Marked contigs are point of concern. out of 12 Flye assemblies, 3 show something similar.

Questions:

  1. What can be the reason for this merging of contigs ?
  2. how this can be resolved?
  3. Does presence of these merged contigs mean that my genome assembly is wrong ?
  4. Is there any other way/tool to verify genome assemblies ?

I also tried similar approach using Nucmer from Mummer2 package for the Genome to genome alignment and then visualization via circos with min_identiy=98% and min_length=2000 as filter parameters and got similar results or merged contigs.

Your helpful insight will be helpful to me.

Regards

flye genome assembly fungus • 737 views
ADD COMMENT
0
Entering edit mode

Just to add on to above information, I have tried another assembler NextDenovo_v2.5.2 and then polished the assembly with same Nanopre + Illumina SR data using NextPolish_v1.4.1 using best option.

Later performed similar synteny analysis using Satsuma2 and circos and results show similar trend of chimeric contigs. image attached below.

enter image description here

ADD REPLY
1
Entering edit mode

results show similar trend of chimeric contigs.

Do you have a reason to believe that your strain is going to look identical to the published relative. If two separate assemblers are showing similar results perhaps that is the truth. You could do some long range PCR etc to verify if that is accurate.

ADD REPLY
0
Entering edit mode

Actually the trend is not similar across other samples. for example

enter image description here

enter image description here

So this makes me worry if the assemblies itself are not correct or there can be some other biological reason or technical reason for this kind of patterns.

I have also tried to look for any Ns strings in fastA file but there were non.

ADD REPLY
1
Entering edit mode

Recently I experienced something similar assembling some strains of a filamentous fungal pathogen. Flye and nextDenovo assembled many chrimeric contigs. My interpretation was that they were misassemblies, because not the same chimeric contigs were produced. I couldn't figure out why Flye and nextDenovo were merging contigs from difference chromosomes, so I ended up using the assemblies from Canu, which were essentially free of chrimeric contigs.

ADD REPLY
0
Entering edit mode

Hi.

I have tried canu as first assembler but canu gives alot of contigs. Like in 300 to 400.

As far as i under Canu generates a new contig for every haplotig (if it finds SNP in reads). How to deal with that ? Did you use Purge Haplotigs ?

ADD REPLY
0
Entering edit mode

Yes, I observed that Canu, differently from Flye and nextDenovo, reported 10 to 50 small contigs shorter than 50 kb or so. I just kept them in the assembly.

In my case Canu didn't try to assemble haplotigs, because the genome is haploid. If your sample is diploid, or is cross-contaminated, then yes I expect different haplotigs would start to appear in the assembly.

ADD REPLY
0
Entering edit mode

I have canu assemblies too as my fungus is also filamentous and haploid. After assembly i mapped raw_reads back to assembly and generated coverage histogram using qualimap and it kinda looks like this.

Because of the small peak at left of graph, i assume that there are repeats/ or haplotigs in the assembly.

what do you suggest.enter image description here

ADD REPLY
1
Entering edit mode

You could check which contigs are giving you the left peak; I'd expect it's coming from short contigs that are basically repetitive DNA. If qualimap does not give you this info, you could try mosdepth to do that. You could also try to remove the short contigs if they are contained within larger contigs using dedupe.sh with >=99% identity or so.

ADD REPLY
0
Entering edit mode

I am not very experienced with this but you could map your reads (used to create the assembly) back to your assembly and then look for any weird patterns (might look kinda like signatures of a structural variant, like reads pairing across contigs) that might be coming up that might suggest misassembly

ADD REPLY
0
Entering edit mode

Hi. Thank you for suggestion. Should I map the Long-Reads back to assembly or the corresponding Short-Reads ?

Also, how to check for these patterns ? In qualimap report generated from the bam file or is there any other way ?

ADD REPLY

Login before adding your answer.

Traffic: 1995 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6