A biologist once asked me if it is possible to detect polyploidy from an assembly. I thought no because duplicate genomic regions are merged in an assembly.
Is this thought process correct or is it possible? If yes, are there any tools?
A biologist once asked me if it is possible to detect polyploidy from an assembly. I thought no because duplicate genomic regions are merged in an assembly.
Is this thought process correct or is it possible? If yes, are there any tools?
One way to test polyploidy is probably to realign raw reads against the assembly and counting read frequencies of variants.
In a diploid genome, we expect to find predominantly variants supported by 50% of the reads (two alleles, heterozygosity). In a triploid genome, we assume to find variants supported by 33% and 66% of reads (three alleles). Tetraploid: 25% and 75%. And so on.
I expect many caveats of this approach though. First, it assumes that all polyploid chromosomes are truly collapsed into a single chromosome in your assembly and not assembled as separate contigs. Second, it requires very high read coverage across the genome to call low frequency variants. Third, read coverage will fluctuate, making reliable estimates of read frequencies difficult. Fourth, variant mis-calls and duplicated regions will complicate the picture.
However, maybe it is possible to pool the genome-wide evidence of many thousand variants to come up with the most likely polyploidy status of your assembly.
Hi,
if you still have the raw reads (before mapping) or the mapped reads you can try to identify duplicate genomic regions by detecting regions with significantly higher coverage. If your global coverage is of n and some region have a coverage of (theoretically) 2n this might indicate this region is duplicated. I unfortunately don't have some obvious reference to share (these are just memories from some presentation) but you can check what is done to detect CNVs for example (even though other methods are more widely used).
I hope it has been helpful.
Addition: from the comments below it seems the method is non-trivial and might not be the most suitable one.
I agree with you (and Casey Bergman post which raised the same concern). If we consider polyploidy as the presence of supernumerary chromosomes (the actual definition) this approach won't work. But the question mentions "duplicate genomic regions" which motivated me to help on how to identify such regions. I could have been more precise.
Thanks for your different inputs, as I mentioned I just hear about such methods but don't experience with them. Reading from more experimented persons it seems this is not trivial and some other methods or additional experimental work might be more suitable. I'll update my first post according to this.
Interesting question. In theory, it is not possible to detect a recent, complete auto-polyploid genome from a WGS assembly since the copy number of all chromosomes would scale perfectly with ploidy. That is, if all regions of the genome in a polyploid are the same (ie. no sequence variation among homologous chromosomes), you can't tell if the genome is 1C, 2C, 4C, etc.
However, for an allo-polyploid genome or for partial (auto- or allo-) polyploidy that is not complete across the genome, then it should be possible to detect the polyploidy from assembly of divergent haplotypes or regional differences in read depth as noted by Phillipe.
In plants, there are works suggesting that polyploidization is accompanied by rapid accumulation of mutations (look up Avi Levy's work). So it should be possible to find multiple alleles - heterozygocity for SNPs + indels
Look up Avi Levy's work from the Weizmann Institute of Science (he worked on wheat).
Eitan Rubin
I don't agree with Casey Bergman that "it is not possible to detect a recent, complete auto-polyploid genome from a WGS assembly". Say, your (duplicated) genome is 1Gb in size, you sequence to 100x coverage, so 100Gb. After assembly, for a non-duplicated genome, you would expect the assembly size to be approximately 1 Gb, with an average coverage of the non-repetitive parts approx. 100x. If the genome instead was a complete auto-polyploid (all chromosomes duplicated), duplicate chromosomes collapsed during assembly, so you will see something like a 0.5 Gb total assembly size with an average coverage of the non-repetitive parts of 200x. This is of course an ideal situation, but you get the idea.
This requires knowledge about the genome size, which was not stated in the question. Clearly with additional information (e.g. a reference genome, knowledge of genome size, cytological information), ploidy can be estimated. Though I would not find overall fold-changes in depth of coverage convincing evidence since the expected throughput of a WGS experiment is not the observed throughput and you could make false inferences with this approach.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This is a clever idea and could probably be applied to some recently derived auto-polyploids.
That's brilliant idea! I have first seen it applied in Yoshida et al. 2013 (see Figure 9). We used it in analysis of Candida orthopsilosis hybrids (Figure S5). What's cool, we were able to detect copy number variations of individual chromosomes or even chromosomal arms (C. metapsilosis, submitted)!
Have any new methods came up?