Entering edit mode
7.7 years ago
selias
▴
40
Hi I got a reseq data of a plant genome. After variant calling by GATK most of the SNP found are heterzygote. Can anyone please point out if anything is going wrong. Is it possible to get almost all the SNPs as HET?
thanks!
update: when I see them in IGV, i can see that one of the allele e.g is 30% and another is 70%. How can I extract only the higher percentage allele as the genotype?
What's the ploidy? Did you use ploidy option of haplotype caller?
this is diploid rice genome. I have used the GATK haplotypecaller without specifying any ploidy. Following the same procedure and commands, I have called SNPs in 2 more rice genomes and they have both homozygous and heterozygous SNPs, but this particular genome has most of the variants heterozygote.
Are you sure you want to do that? I mean, if there are so many reads matching to ALT allele, there is almost a very high probability to have a het than hom. Moreover, if you have found that the same protocol works well with other 2 rice genome, what is your reason to believe that it is doing something wrong here?
You may also check the PL and GQ filed of your sample (see link below), to be more sure about the quality of genotype call
http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it
Thanks very much for the reply. I was just thinking of doing that to see how the result comes out. Ya the same protocol worked very well with the other genomes. The unusual high percentage of het is making me to doubt the result, but otherwise it seems clean. I am just wondering what info it is trying to say. I have checked the GQ and PL values. GQ is 99 for many and PL is, say for example 3048, 0, 5339. thanks for your suggestion
Your GQ and PL, both says that the s/w has absolute confidence in calling the (het) Genotype. The highest value of GQ that GATK assigns is 99. See the link above for description.
Thanks Santosh. Ya, I can see. Actually, in some older builts of GATK, people reported bugs of having abnormally excess heterozygotes, but those should not be an issue in the newer one. I assume if the PL values were 400, 0, 5339 then we could doubt about that particular one?
PL values are normalized, with most probable genotype getting zero score (the het here). The next one is quite far at 400 (hom ref) in phred scale, which means that it's certain that the genotype is het.
You mentioned three different experiments of which only one shows most of the snps as hets. Are all these three experiments were done using the same chemistry and sequencing machine? Do you have any information about these particular samples? Are you sure these three rices are from the same Subspecie and have to be used with the same reference? What is the reference you are using?
Hi Petr. Yes, these were done using same chemistry and sequencing machine. appparently they are from same subspeicies, but the particular one i am talking about have prev evidences to fall in a different sub clade. can it be a reason?. and yes, using same reference is not logical. I am using nipponbare reference genome because it is well annotated.
yes it may be the reason if two other samples are nipponbare and the heterozygous one is not. Do you have aproximately the same number of snp called in all three samples and are coverages the similar?
Thank you Petr. The other two samples are not also nipponbare related, they are also from diff sub sub clade than nipponbare. The SNP calls and coverage does not have a huge difference, other than number of het and hom.
hm... Could you please tell us more about these sub clades? Do you think the het one is closer to spices found in the wild while the hom samples from the clade that undergone more human selection and cultivation? In other terms can you find any logical explanation for het/hom difference bewteen these three?
Ya, that can be a reason. The other 2 have undergone more cultivation but the het one is newly collected from farmers. All of them are indica subspecies but in phylogenetic analysis the het one fall away from the other two. But the high percentage of heterozygosity seems so unreal.
for old stable varieties having very high hom percentage is ok, while for a variety from wild nature or a variety from a very new cross between different subspecies having higher het percentage is ok. Do you get consistent results with indica reference?
Also, could you tell us more about sample collection, sample preparation, library prep and so on in all three cases?
seeds weere grwon in greenhouse condition and then DNA was collected from seedling stage rice plants, Multiple plants were pooled to get DNA. The library preparation and sequencing was done commercially. Illumina chemistry was used. 4 diffferent insertion size library was prepared. protocol was same for all samples.
I am actually aligning it with indica and has not finished yet. will update.
Could this be contamination? Perhaps try to validate a few variants (using sanger sequencing) from a fresh sample.
Thanks for your reply. I was quite sure it is not contamination as the samples were prepared cautiously. I tried to see in gel some allele specific markers, which also showed two bands. thanks