Question

Determining ploidy via read ratios with HaplotypeCaller: skewed results?

0

Entering edit mode

8.1 years ago

mmats010 ▴ 80

Hello, I am attempting to determine the ploidy of my organism by observing where the modes of SNP allele ratios fall. For example, a 2N diploid should have a peak of allele read-counts at 0.5 since most SNPs should have, for example, 15 reference and 15 alternate reads. A triploid should have two modal peaks around 0.33 and 0.66, and a tetraploid should have four peaks around 0.25, 0.5, and 0.75.

I am using HaplotypeCaller to call SNPs across ~50 individuals as a cohort in gVCF+joint genotyping mode, then using SelectVariants to keep biallelic SNPs and using the JEXL expression " -select 'vc.getGenotype("sample").isHet()' " to isolate only those SNPs called as heterozygous, then taking the "-GF AD" option on SelectVariants and parsing the allele depths in the resulting table, and finally calculating the alternate allele read depth divided by the total read depth to get my final "ratio". Each individual has >200,000 SNPs included in analysis.

While this workflow seems to work and give a modal peak around 0.5, every sample previously known as a 2N diploid I try this on shows a significant shoulder on the left side of the density plot I generate (see attached, sample name = "1306"). When I apply this to an artificial tetraploid (two 2N diploid individuals merged) I can see the expected three modal peaks around 0.25, 0.5, and 0.75, but again, there is a very significant bias toward the left. The same trend finally applies to an isolate known to be triploid (sample "EC1"). In essence, it appears that the number of reference reads is crowding out the alt reads.

My first theory was that local CNVs drove up reference read alleles, and indeed the SNPs showing this abnormally low read ratio values were usually higher in read depth than the high ratio SNPs. Removing SNPs with high read depths (>200x in a genome with ~50x background) didn't really change anything. My next theory is that reads with alt alleles weren't mapped as well as ref reads, leading HaplotypeCaller to discard them due to low mapping quality.

Is this a problem that anyone employing this this technique has run into? Is HaplotypeCaller just not a good program to use? Should I forgo cohort SNP calling in HC, or just use UnifiedGenotyper? Do I need more sophisticated filtering parameters (I used GATK Best practices recommended for hard filtering, I can't use VQSR)?

TL;DR, too many read ratio peak heights in 2N individuals, and not equal heights in higher ploidies. Why??

Normal diploid Artificial tetraploid Normal triploid

SNP gatk ploidy illumina R • 3.0k views

ADD COMMENT • link updated 5.3 years ago by Biostar 20 • written 8.1 years ago by mmats010 ▴ 80

0

Entering edit mode

Hello mmats, I am currently following the same method to determine ploidy. What kind of organism is it?

Have you tried to discard possible paralogs and repetitive sequences? What I did, I have selected exonic sequences which are considered as unique in my reference genome in order to avoid bias from possible copies. To do it, I did a blastp of all the predicted protein sequences against themselves. Then, I have only selected sequences which have only a significant alignment with itself (e-value = 0), the second alignment, in the output, for each sequence, was not significant. Thus, I assume that the selected sequences are 'unique' regarding the method.

In my case, I do not have a population but only the reads used for the assembly. Then, I assumed that I have to count frequencies for reference allele. Did you use the value given by the AD field and DP field to calculate the 'final ratio'?

ADD REPLY • link 7.7 years ago by dilution • 0