Entering edit mode
9.3 years ago
Damian Kao
16k
What's a good method to resolve segmental duplications vs high levels of heterozygosity in a genome assembly? I've mapped my reads back to the genomic contigs (unscaffolded) and plotted a distribution of coverage:
This shows that a portion of the contigs (~1/4 of the assembly) has double the coverage (100X) of the other portion (50X). I interpret this as either:
- The contigs at double the coverage is duplicated in the genome. I guess this will have to be a recent duplication?
- The contigs at half the coverage are extremely heterozygous, which would mean most of the genome is very heterozygous.
I guess it could also be a mix of both cases.
The contig assembly was performed with Abyss. It's an diploid arthropod estimated to be 3.6gb experimentally.
It was suggest to me to look at the expected coverage of the genome. If it was a duplication event, then the expected coverage would be ~50X. If it was a highly heterozygous genome, then the expected coverage would be 100X. But how would I calculate an expected coverage?
Technically, if we just take the sum length of all trimmed/clipped read data and divide by the estimated genome size of 3.6gb, I would get ~110X coverage. However, is using the sum length of all reads valid? It doesn't account for bad reads (sequencing errors) or highly repetitive regions which would confound the coverage calculation.
However, I guess the fact that the raw coverage is 110X should mean that it is highly heterozygous? We would expect the raw coverage to be a bit higher than 50X if it was segmental duplications? The amount of bad reads and repetitive reads is not so high that the coverage would fall below 50X. The average coverage of k-mers from the k-mer spectra (discarding low multiplicity and really high multiplicity k-mers) is ~83X.
Another issue is that the total length of the contig assembly is ~2.9gb. The sum of 100X coverage contig is ~750mb and the 50X coverage contig is ~2gb. Meaning if it was segmental duplications, we would have 2gb + (750mb * 2) = 3.5gb whch matches our estimated genome size nicely.
If it was highly heterozygous, then it will be 2gb / 2 + 750mb = 1.75gb which would mean our sequencing libraries did not have even coverage of the genome. For whatever reason, it only covered half of the genome.