I have sequenced a fungal genome using nanopore technology. I have tried diferrent assemblers and parameters (necat, flye etc) and all of the final assemblies result in a genome larger than expected for this species. Exprected genome size is 32-34 million and my assemblies are around 39 million bp. So far :
Busco results show 97% score whithin the fungal Order, which is pretty good, and only 24 duplicates.
Raw read mapping to the assembly showed 99. 88% score and around 385.000 bp were found to have low coverage threshold, while 140.000 bp had a very high coverage threshold.
reference mapping with another strain showed 100% mapped reads BUT 5411 supplementary alignments , 311 non primary alignments and 6.8% error rate. Cigar alignments without clipping showed 34Mbp.
Could someone please help me understand if these data indicate an existing problem in my assembly and how could i proceed in order to find out if these extra million bp are real (seems unlikely) and how could i correct any mistakes?
Why do you feel that way? Have you aligned your assembly to an available reference and checked where these "extra" sequences are located? Are they in contiguous chunks? If so, they may be really present in your genome.
Because it is a very common species and there are 15 available genomes around the size i mentioned. The problem is that they all have many contigs (over 300), so i dont know how i can manually see where the diferrences are. IGV seems a little complicated to manually check 300 contigs. Do you have know any other possible ways to do that?
That may indicate that these may not be "complete/reference" quality. How many contigs did you end up with? Is that more in line with expected number of chromosomes?