Question

How Can I Compute The R Square Hat Statistic For Imputed Data?

1

Entering edit mode

12.2 years ago

Kantale ▴ 140

Hi,

I have imputed a large dataset with mach / minimac and that resulted in 9 TB of data. The next step of my analysis is to compute the r square hat metric on order to assess the quality of the imputation. (I know that minimach contains it's own quality metric, but I am interested in r square hat particularly).

The QuickTest tool can extract this metric with the option: --compute-rSqHat which according the documentation:

Compute r−squared-hat for each SNP, which is the (estimated) fraction of variance in unobserved 0/1/2 genotype explained by the the individual mean genotypes. (We assume this is the same deﬁnition used by Abecasis et al.)

The problem is that in order to run quicktest I will have to convert 9TB of data to the QuickTest format. To save this hassle I would prefer to write a script to calculate this metric. So, does anyone know how to compute the "r squared hat" from dosage data (or from imputation a-posteriori probabilities) ? All I am asking is a formula that will take as inputs imputation dosage data from a single SNP and will estimate the "r square hat" metric.

Thanks a lot!

imputation statistics • 7.1k views

ADD COMMENT • link 12.2 years ago by Kantale ▴ 140

1

Entering edit mode

See "Imputation and association testing" about INFO calculation: http://hmg.oxfordjournals.org/content/17/R2/R122.full It seems mach quality metric is RSQR_HAT, is it not the same metric? Also, using plink with dosage files gives out INFO http://pngu.mgh.harvard.edu/~purcell/plink/dosage.shtml

ADD REPLY • link 12.2 years ago by zx8754 12k

0

Entering edit mode

The reason I cannot use the RSQRHAT from mach is that I have performed sample chunking. So I do have the RSQRHAT value for each of my sample chunks (around 30) but I want to be able to compute it in total for all samples per SNP.

ADD REPLY • link 12.2 years ago by Kantale ▴ 140

0

Entering edit mode

Not sure what is the common practice for this situation, but I think getting mean/median of RSQHATs over chunks should be OK, if you want to be strict then maybe minimum. Again, if you run plink assoc on dosage, the INFO would be calculated overall.

ADD REPLY • link 12.2 years ago by zx8754 12k

score 1 · Answer 1 · 2013-02-14

1

Entering edit mode

12.2 years ago

Kantale ▴ 140

I was able to figure out what is happening (with the help of QuickTest author who responded to me kindly). On page 32 of this document there is a detailed presentation of the r square hat metric. PLINK's INFO metric uses the G2 definition whereas QuickTest with r−squared-hat option uses G3. In principal, as it is discussed in the document these metrics are equivalent under HWE otherwise can generate different values. I wrote these formulas in python for whoever is interested:

ADD COMMENT • link 12.2 years ago by Kantale ▴ 140

0

Entering edit mode

Dear Kantale, although it's long after your post, I am interested to see your python script to compute the r square hat metric to assess the quality of the imputation done by Beagle software. The links you gave can not be reached. It would be of great help if you kindly provide those again. Thank you in advance

ADD REPLY • link 6.2 years ago by jzannatun • 0

0

Entering edit mode

Hi, I came here by accident after a long time and saw your post. Sorry for the delay. The scripts are available here: https://github.com/kantale/scripts/blob/e6e48ee3a514a7a4fa2b40479925e0ee635e897c/imputation/imputation_quality_metrics.py

ADD REPLY • link 5.1 years ago by Kantale ▴ 140