Hello,
I am working on a single celled organism, that I am isolating from a natural environment and sequencing it using illumina PE.
How can I determine it's ploidy? There is no reference genome.
I was thinking of mapping the reads to a denovo assembly, and seeing what is the maximum number of alleles I can find per locus, and what are their frequencies.
Adrian
Sound reasonable. Maybe you can do this even faster with a kmer based approach, provided you find a way to differentiate between alleles and sequencing errors.
This is an excellent idea, I have already tried it:) I build a kmer graph, and I see 3 peaks. The last peak is the fattest (if you know what I mean), the second peak is half the coverage of the last, and the first peak is half the coverage of the second. This suggests that Allele frequencies in the data set are either 0.25, 0.50 and 1.00.
This potentially suggests the organism is tetraploid. However, I am missing a peak for 0.75, but I believe since the 1.00 peak is so thick, it is hiding the 0.75 peak.
Do these conclusions sound correct?
As for seq. error... this is illumina, and the run was of high quality, so I suppose the errors would simple contribute to the bell curve in my peaks. Any other suggestions? I can filter reads based on quality I suppose, but I heard people warning against this, since it introduces bias.
Your conclusion sounds reasonable to me, although I am not an expert on interpreting these peaks. With respect to the sequencing errors I think you are also right. To clean up your data you could also just throw away all kmers that occur only a few times.
Hi Adrian, I am also using kmer strategy in order to determine ploidy. Can you tell me more about the tool you used and what did you do with the output, please?
Couldn't one of the additional peaks be an organellar genome?
Good point, but I work on an organism that does not have an organellar genome. The peaks could also originate from contaminants, bacteria.
That is why I have mapped my reads to my draft assembly (bwa), of contigs I am certain come from the correct nuclear genome, and used those reads that mapped to construct the kmer graph.