Hello everyone
I am first time working on imputation of GWAS data. I have chromosome specific VCF files. In one of the chromosome file, I have 195276 SNPs with 293 individuals. These are the steps I followed
1) Upload of VCF on Michigan imputation server with selection of reference panel All steps such as Input Validation, Quality Control and Pre-phasing and Imputation worked without any error. In report it was defined:
Excluded sites in total: 1,532 Remaining sites in total: 339,099
As output I received chr.dose.vcf.gz
2) Next, I used PLINK to get the " "bed" and "bim" file format.
./plink --vcf chr.dose.vcf.gz --make-bed --double-id --biallelic-only --out chr_biallelic
Plink log file gave me information "4057885 variants and 293 people pass filters and QC".
Its big change in no of SNPs from 339,099 to 4057885.
3) Is it ok if I extract 339,099 SNPs from chr.dose.vcf.gz in order to stick only with the desired SNPs site ?
I will appreciate all the suggestions.
Thanks in advance A
Maybe I'm not following right, but it looks like you have 339,099 sites that were directly typed that you uploaded to the imputation server. The server then imputed out to 4 millionish sites.
Why would you want to extract out your 339,099 SNPs after imputation, what was the point of imputing then?
I'm scratching my head a bit here, could you better describe what you are trying to achieve?