I assemble diploid fungal genome (illumina PE 100bp reads, coverage in range 200-300x). I believe, the size of this genome is ~13 Mb, but assembly I got always is between 22-24Mb. I've used Velvet, SOAPdenovo using multiple parameters sets.
Interestingly, when scaffolds are aligned against another from the same assembly, you will find around 1/3 of the genome aligns with 80-90% identity. We even sequenced additional insert size library, but results are similar.
How to decide, whether this scaffolds are duplicated or heterozygous allels?
I'm no expert in this field but it might be worth to look at what has been done to detect CNVs (Copy Number Variation) regions (the problems are somehow similar). One method is for example to look at the coverage of your genome. If you have a median coverage of 200x and some region have a coverage of 300x that might indicate that this segment is duplicated.
I am sorry I have no references to share I just remember hearing this during some talks. There was some statistical methods to discriminate those regions.
Are you sure the genome size is 13Mb? I would more believe the truth is around 20Mb. I do not work with fungus genomes, but it seems quite unlikely for two haplotypes from the same strain to have 10-20% divergence. If this is really true, there is almost no way to tell segmental duplications from different alleles.
Your best hope is to sequence an inbreed strain. Ploidy has caused quite a lot of problems to higher Eukaryotic genomes (e.g. Ciona and zebrafish) and should be worse for fungi. This is a long-existing problem in de novo assembly. If there were a simple solution, those smart people would have found that.
That makes sense. On the other hand, I have not seen assembler do that bad on estimating the genome size. I used to get sanger data for a diploid fungus genome, the size estimate is quite good. Of course difference clades may have completely different stories.
we have some close relatives sequenced, but there is very weak similarity @ nucleotide level. All close species from that clade are having genomes in range 12-15Mb, so this is why I suspect duplication.
That makes sense. On the other hand, I have not seen assembler do that bad on estimating the genome size. I used to get sanger data for a diploid fungus genome, the size estimate is quite good. Of course difference clades may have completely different stories.
we have some close relatives sequenced, but there is very weak similarity @ nucleotide level. All close species from that clade are having genomes in range 12-15Mb, so this is why I suspect duplication.