Edit: See answer. tl;dr: code should have been:
plink2 --vcf [input-name] dosage=HDS --exclude-if-info "R2<=0.3" --export vcf --out [output-name]
Original post
I'm struggling with post-imputation processing of some data, and I would be very grateful for some guidance.
I have a data set that has been imputed through Michigan Imputation Server. I now need to perform post-imputation processing. I've attempted to run the data through Plink with the following command:
plink2 --vcf [vcffilename] dosage=DS --exclude-if-info "R2<=3" --score [scorefilename]
E.g.:
plink2 --vcf file1.DOSAGE.vcf dosage=DS --exclude-if-info "R2<=3" --score file1.INFO
I want to process this data per superpopulation (AFR, ALL, AMR, EUR, EAS, or SAS), but I am working with sample sizes less than n=50 per superpopulation. As a result, when I try to run the data through Plink, it reports that I need frequency files from larger, similar populations. I figured that the .freq files for the 1000G superpopulations would work for this, but I cannot for the life of me find any such files. I tried to create my own, but the files located at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ just have "." in the ID column, which seems to be a problem for Plink.
Questions:
- Do .freq files per superpopulation already exist for the 1000G data? If not, is there a simple method for creating them?
- Is this even the correct way to perform post-imputation QC? I'm very new to working with this kind of data, so I'm honestly just flying blind and hodgepodging steps together from tutorials and methods I've found around the web.
Thanks in advance for any and all help. It is greatly appreciated.
What are you planning to do with the data when you finish the QC? I think plink calculates R^2 using the data, hence why it needs more than 50 samples. Does your data have an imputation INFO score? You can filter on that instead since it's already calculated by the imputation server (I think).
Thank you very much for the reply!
I know that ultimately my PI's intent is to run analyses to search for correlations between variants and participant diagnoses, but I otherwise don't know what we're doing with the data post-QC.
I received two files per chromosome from the imputation server, e.g., chr1.dose.vcf.gz and chr1.info.gz. An example of the data contained in those files:
Dose file:
Info file:
I started with attempting to filter by rsq since it was the only post-imputation QC recommendation I could find in the Michigan Imputation Server documentation. Do you know how I might use the INFO score to filter instead?
Thank you again!