Question

mostly heterozygous SNP

4

Entering edit mode

8.2 years ago

selias ▴ 40

Hi I got a reseq data of a plant genome. After variant calling by GATK most of the SNP found are heterzygote. Can anyone please point out if anything is going wrong. Is it possible to get almost all the SNPs as HET?

thanks!

update: when I see them in IGV, i can see that one of the allele e.g is 30% and another is 70%. How can I extract only the higher percentage allele as the genotype?

SNP • 3.7k views

ADD COMMENT • link 8.2 years ago by selias ▴ 40

0

Entering edit mode

What's the ploidy? Did you use ploidy option of haplotype caller?

ADD REPLY • link 8.2 years ago by Santosh Anand 5.8k

0

Entering edit mode

this is diploid rice genome. I have used the GATK haplotypecaller without specifying any ploidy. Following the same procedure and commands, I have called SNPs in 2 more rice genomes and they have both homozygous and heterozygous SNPs, but this particular genome has most of the variants heterozygote.

ADD REPLY • link 8.2 years ago by selias ▴ 40

0

Entering edit mode

when I see them in IGV, i can see that one of the allele e.g is 30% and another is 70%. How can I extract only the higher percentage allele as the genotype?

Are you sure you want to do that? I mean, if there are so many reads matching to ALT allele, there is almost a very high probability to have a het than hom. Moreover, if you have found that the same protocol works well with other 2 rice genome, what is your reason to believe that it is doing something wrong here?

You may also check the PL and GQ filed of your sample (see link below), to be more sure about the quality of genotype call

http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it

ADD REPLY • link 8.2 years ago by Santosh Anand 5.8k

0

Entering edit mode

Thanks very much for the reply. I was just thinking of doing that to see how the result comes out. Ya the same protocol worked very well with the other genomes. The unusual high percentage of het is making me to doubt the result, but otherwise it seems clean. I am just wondering what info it is trying to say. I have checked the GQ and PL values. GQ is 99 for many and PL is, say for example 3048, 0, 5339. thanks for your suggestion

ADD REPLY • link 8.2 years ago by selias ▴ 40

0

Entering edit mode

Your GQ and PL, both says that the s/w has absolute confidence in calling the (het) Genotype. The highest value of GQ that GATK assigns is 99. See the link above for description.

ADD REPLY • link 8.2 years ago by Santosh Anand 5.8k

0

Entering edit mode

Thanks Santosh. Ya, I can see. Actually, in some older builts of GATK, people reported bugs of having abnormally excess heterozygotes, but those should not be an issue in the newer one. I assume if the PL values were 400, 0, 5339 then we could doubt about that particular one?

ADD REPLY • link 8.2 years ago by selias ▴ 40

0

Entering edit mode

PL values are normalized, with most probable genotype getting zero score (the het here). The next one is quite far at 400 (hom ref) in phred scale, which means that it's certain that the genotype is het.

ADD REPLY • link 8.2 years ago by Santosh Anand 5.8k

0

Entering edit mode

You mentioned three different experiments of which only one shows most of the snps as hets. Are all these three experiments were done using the same chemistry and sequencing machine? Do you have any information about these particular samples? Are you sure these three rices are from the same Subspecie and have to be used with the same reference? What is the reference you are using?

ADD REPLY • link 8.2 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Hi Petr. Yes, these were done using same chemistry and sequencing machine. appparently they are from same subspeicies, but the particular one i am talking about have prev evidences to fall in a different sub clade. can it be a reason?. and yes, using same reference is not logical. I am using nipponbare reference genome because it is well annotated.

ADD REPLY • link 8.2 years ago by selias ▴ 40

0

Entering edit mode

yes it may be the reason if two other samples are nipponbare and the heterozygous one is not. Do you have aproximately the same number of snp called in all three samples and are coverages the similar?

ADD REPLY • link 8.2 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Thank you Petr. The other two samples are not also nipponbare related, they are also from diff sub sub clade than nipponbare. The SNP calls and coverage does not have a huge difference, other than number of het and hom.

ADD REPLY • link 8.2 years ago by selias ▴ 40

0

Entering edit mode

hm... Could you please tell us more about these sub clades? Do you think the het one is closer to spices found in the wild while the hom samples from the clade that undergone more human selection and cultivation? In other terms can you find any logical explanation for het/hom difference bewteen these three?

ADD REPLY • link 8.2 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Ya, that can be a reason. The other 2 have undergone more cultivation but the het one is newly collected from farmers. All of them are indica subspecies but in phylogenetic analysis the het one fall away from the other two. But the high percentage of heterozygosity seems so unreal.

ADD REPLY • link 8.2 years ago by selias ▴ 40

0

Entering edit mode

for old stable varieties having very high hom percentage is ok, while for a variety from wild nature or a variety from a very new cross between different subspecies having higher het percentage is ok. Do you get consistent results with indica reference?

ADD REPLY • link 8.2 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Also, could you tell us more about sample collection, sample preparation, library prep and so on in all three cases?

ADD REPLY • link 8.2 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

seeds weere grwon in greenhouse condition and then DNA was collected from seedling stage rice plants, Multiple plants were pooled to get DNA. The library preparation and sequencing was done commercially. Illumina chemistry was used. 4 diffferent insertion size library was prepared. protocol was same for all samples.

ADD REPLY • link 8.2 years ago by selias ▴ 40

0

Entering edit mode

I am actually aligning it with indica and has not finished yet. will update.

ADD REPLY • link 8.2 years ago by selias ▴ 40

0

Entering edit mode

Could this be contamination? Perhaps try to validate a few variants (using sanger sequencing) from a fresh sample.

ADD REPLY • link 8.2 years ago by WouterDeCoster 48k

0

Entering edit mode

Thanks for your reply. I was quite sure it is not contamination as the samples were prepared cautiously. I tried to see in gel some allele specific markers, which also showed two bands. thanks

ADD REPLY • link 8.2 years ago by selias ▴ 40