Hi guys,
I am working with several diploid fungal genomes but confused on how I deal with the duplicated genes. I started by assembling the diploid genomes using dipSPAdes, then gene finding with Maker. The reported number of genes is ~20,000 genes, which is about 2x as many genes that are reported for haploid genomes of the same genus. So my question is, is it okay to go forward with functional gene annotation, or do I need to somehow get rid of the duplicate genes in the genome? I have been confused about whether it is appropriate to publish the diploid version of the genome, or if it is necessary to report the haploid version. I hope this makes sense!
Thanks in advance, Morgan
Is the species you're working on (highly) heterozygotic? If not then dipSPADES is unfortunately not the most appropriate choice of assembler software.
I do not believe so, (but also not 100% sure) because they exhibit both haploid and diploid cells. I personally observed this under the microscope after staining with DAPI. Also, the diploid genome was much higher quality in terms of number of contigs, size, and N50 when I compared the assembly to the regular SPAdes assembly. BUSCO confirmed that approximately 70% of the single copy orthologs were duplicated. Do you recommend a certain assembler so that I could compare them?
It's true that is not common to find diploid genome annotation within databases. I don't think the EBI or NCBI submission pipeline will make any difference whether it is a haploid or diploid annotation. But you should contact them to know what would be the best way to submit your data. I'm looking forward to hearing more about it. One of the problem I could see is that the alleles of a locus have two different gene identifiers in your MAKER annotation. So it means then you will have two loci identifiers for only one locus... So it would be bit wierd ...
Thanks for the advice, I'll contact the databases and can post an update here. I should have thought about this sooner before proceeding with assembly and annotation :/ I just wonder if there is a way to "fix" this with the gene predictions instead of having to start from the beginning with the assemblies.
If you know which contigs are part of which assembly (primary or secondary) then it's not a problem to filter your annotation.
That is good to hear. Do you recommend any program that can do this? Would it basically be some sort of alignment program that can detect the duplicated genes?
Usually it is your assembler that would give you the phased genome. But I don't know how look the dipSPAdes outputs.
Update:
I decided to go forward with the haploid genomes instead, so I used purge_haplotigs pipeline to do so. The genomes were greatly reduced in size and annotations.