Hi,
I have imputed a large dataset with mach / minimac and that resulted in 9 TB of data. The next step of my analysis is to compute the r square hat metric on order to assess the quality of the imputation. (I know that minimach contains it's own quality metric, but I am interested in r square hat particularly).
The QuickTest tool can extract this metric with the option: --compute-rSqHat which according the documentation:
Compute r−squared-hat for each SNP, which is the (estimated) fraction of variance in unobserved 0/1/2 genotype explained by the the individual mean genotypes. (We assume this is the same definition used by Abecasis et al.)
The problem is that in order to run quicktest I will have to convert 9TB of data to the QuickTest format. To save this hassle I would prefer to write a script to calculate this metric. So, does anyone know how to compute the "r squared hat" from dosage data (or from imputation a-posteriori probabilities) ? All I am asking is a formula that will take as inputs imputation dosage data from a single SNP and will estimate the "r square hat" metric.
Thanks a lot!
See "Imputation and association testing" about INFO calculation: http://hmg.oxfordjournals.org/content/17/R2/R122.full It seems mach quality metric is RSQR_HAT, is it not the same metric? Also, using plink with dosage files gives out INFO http://pngu.mgh.harvard.edu/~purcell/plink/dosage.shtml
The reason I cannot use the RSQRHAT from mach is that I have performed sample chunking. So I do have the RSQRHAT value for each of my sample chunks (around 30) but I want to be able to compute it in total for all samples per SNP.
Not sure what is the common practice for this situation, but I think getting mean/median of RSQHATs over chunks should be OK, if you want to be strict then maybe minimum. Again, if you run plink assoc on dosage, the INFO would be calculated overall.