I have a large dataset of whole genome sequencing data. Recently in a large GWAS study, I learned a number of promising significant hits. I would like to check to see if these these SNPs are associated (with the specific phenotypic trait I'm interested in) in the whole genome sequencing data that I have. The data was genotyped by Illumina. I have the .bam files and the .vcf files that they provided. In these kinds of studies, what is the general work-flow that needs to be done in order to do this type of analysis? Because I have the LD block of these SNPs, my thought was to extract these sections from the WGS data first using SAMTOOLS (or R). Do I need to convert these into vcf files after? And run an association analysis based on my phenotype of interest? Thanks for your help, in advance.
Hi Katie,
Thanks for your reply! Could you explain more with what you mean by "Doing it the way you described above has advantages (shorter run time, less storage needed, smaller corrections for multiple tests), but what a shame it would be to ignore so much data!" My thought that yes, it would be must faster, but I would also be removing any unrelated information. Is this a naive approach? What do you suggest would be a better method?