Hi Biostars:
I have genotype data called from Whole Exon Sequencing, which contains 0.12 million variants (most are SNPs). In theory, it is feasible to impute all SNPs on whole genome using genotype imputation model. But I am wondering how accurate it would be to use such a small portion of variants (exon variants) to impute whole variants on the genome (> 10 million using 1000G reference). I understand R2 can be used to filter low-quality imputed variants, but is it really OK to do imputation in this way?
Thanks! Tao
Hi Kevin, Thanks for your reply and time! It's a public dataset, we just want to use it to fit our project which need genotype on whole genome. For the first and second reference, it seems they only imputed variants on Exome based on a reference panel of Exome sequencing project(NHLBI). For the third reference, it seems they want to prove using imputation based on whole genome array(Omni2.5) and 1000G, they can recover the sites on exon chip. So, I didn't see they have done similar way like I described. Please do correct me if my understanding is not correct. Thanks! Tao
Hey Tao,
Yes, the idea is that these are just similar studies, i.e., not the exact same, but neither completely different, that you could use as a starting point.
I do not doubt that you could complete an imputation in the way that you desire, but it's just the credibility of the results that I doubt. Imputation is just statistical relationships, at the end of the day, and is known to produce incorrect genotypes even when done properly.
I hope that others can contribute to the discussion.
Kevin
Hi Kevin,
Thanks! You are right. Imputation can be done without any error, but how accurate it would be?That's exactly what I concerned! Thanks for your references.
Best, Tao
Hey Tao,
I would really doubt the accuracy, particularly as you go further into intergenic regions and away from genes. Far away from each gene, you just won't have concrete data with which to make any sort of accurate imputation - it would be akin to making random calls, i.e, by chance, you'll be able to impute some genotypes far away from genes, but these could possibly be errors. However, as you implied, I think that many of the imputed SNPs would not even make it to the final dataset as they may fall well below r-square 0.3 or 0.4, or would fail by some other metric.
What is the aim of your experiment, generally? If you are just interested in imputing genotypes in enhancer and promoter regions, the TSS, or the 5'/3'UTR, then you could just impute a certain distance from each gene. Why not just impute up to 25,000 bp from each gene start and terminal exon? That is probably still too great a distance, but it's worth a try. It will neither encompass all enhancer regions, as these can be >100,000 bp from gene bodies and still regulate transcription.
Another interesting study on this topic is here. In it, the authors specifically state that imputation accuracy suffers as distance between a SNP and an imputed SNP increases.
I would also encourage you to seek the opinions of others in your department, just to corroborate what I am saying.
Kind regards
Kevin
Edit: to give you an idea, high density genotyping microarrays will genotype genome-wide with a mean distance between genotyped positions of ~3,500 bp. Even imputing with that level of density, errors in the imputation occur.
Thanks so much for your suggestions and reference! That's very helpful! This dataset is one of several datasets I used in my project, which need genotypes on whole genome, not only gene nearby regions. Luckily, we just find the genotype data call from WGS is now available for that dataset. So, that's not a big problem for me now. But I benefit a lot from the discussion with you! And I think it will also benefit others with similar situation. Best, Tao
Okay, great, best of luck with the remainder of your project.