Hi everyone,
I have assembled the genome of a non-model organism - insect (de novo genome assembly) and performed read mapping to it. I am currently in the process of performing a variant analysis. For all the different variants detected (SNPs, InDels) the initial results shows that 98% of all the variants are heterozygous (ie more than one variant was called at that position) and 2% were called as homozygous (ie only one variant was called at that position).
My question is, is it plausible to have homozygote variants present, especially since its the same species and the same set of reads were used to produce both the assembly and used in the read mapping? Or is it an error that homozygote variants are being called? I am using the CLC genomics workbench v9 to call the variants
Please advise. Thank you.
So - you have created a de novo assembled genome of the insect yourself - suggesting it is the first genome of this insect. So what are you calling variants against? If this is the first genome, then there aren't any variations.
If you're mapping your reads back to your de novo assembled genome, and calling indels/mutations from that, then the homozygote mutations are essentially de novo assembly errors.
If you are comparing your de novo assembled genome against an existing genome of the insect, then the homozygote mutations represent natural variations - just because it is the same species doesn't mean it will give exactly the same genome - e.g. human 1000 genomes
As suggested by Tonor, homozygous calls are an inconsistency between your assembly and variant calling, likely suggesting errors in one of those.