Question

Post-imputation QC GWAS analysis

0

Entering edit mode

4.2 years ago

AR • 0

Hi,

I am performing GWAS analysis on human samples. My workflow:

I performed Pre-imputation QC using plink and imputation using TopMed Michigan server.
After imputation, I got 22 vcf files + 22 info files. I did Post-imputation QC check on the imputed data (vcf).
After QC I had separate vcf files for every chromosome (22 files).
I merged them using bcftools merge option.

Right now I have a merged vcf for all 22 chromosomes which is around 4.2GB. Also, I have 22 info files merged together around 12GB. I want to check the accuracy of the imputation, basically I want to analyse plots of MAF and Rsquare values. Rsquare and MAF, both I will get from merged info file which is like 12GB. Although, I am not very good with R plots somehow I could manage to plot MAF against frequency of variants. But I am stuck with histogram/scatter plot for MAF against Rsquare in RStudio. I have been trying since last week but my file size is so big that my system hangs up. And even I am not sure of my RScript.

Can anyone please help me or refer me to some good resource for RScripts specifically for such GWAS analysis plots. I have tried online tutorials also for R plots.

Thanks AR

GWAS Rstudio Imputation TopMed • 3.2k views

ADD COMMENT • link updated 3.8 years ago by Biostar 20 • written 4.2 years ago by AR • 0

0

Entering edit mode

Thanks Curious. I am trying to chip off my large files and then will go for plots in R. Thanks for your suggestion. But right now, I am stuck with R2 values. I have analysed a single dataset containing 96 samples. After imputation, there were around 300 million variants. But after post-imputation QC step (R2>0.5) number drastically reduced to 10 million.

My command:

./plink --bfile s2_chr1 --qual-scores chr1.info 7 1 1 --qual-threshold 0.5 --make-bed --out plinkout_chr1

Any help highly appreciated pls.

AR

ADD REPLY • link 4.2 years ago by AR • 0

1

Entering edit mode

I don't know if its fine without seeing the data, thats up to you but overall thats pretty normal to thin out to a few dozen or so well imputed variants

ADD REPLY • link 4.2 years ago by curious ▴ 810

0

Entering edit mode

Thanks for the reply curious. After some struggle, I was able to solve it today. Problem is with plink --qual-threshold. It should be --qual-max-threshold instead of --qual-threshold. Now, I am getting 292 million variants out of 300 million variants after post-imputation QC. I figured it out when I plotted the Rsquare values. Before, all values were like below 0.5 only. Silly mistake I would say.

But, I again found out one very suspicious thing in my Rsquare filter output. It is giving me filtered variants( Rsquare>0.5) but with that it mentions that around 90,000 IDs missing. I am not able to figure out this missing ID problem. It is coming with plink --qual-threshold only not with any other plink modules I have used during my post imputation QC.

--qual-scores: 22796091 variants remaining, 90816 IDs missing.

Thanks. AR

ADD REPLY • link 4.2 years ago by AR • 0

0

Entering edit mode

Might want to look at the qqman package

ADD REPLY • link 3.8 years ago by Sam ★ 4.8k

score 2 · Answer 1 · 2020-08-30

I don't think there are specific resources, you just need to keep chipping.

The common this is to do line plots of box plot that summarizes average rsq over a minor allele frequency bin. kind of like this:

https://imgur.com/a/IIIwZ4n

People do scatterplots sometimes too, but for "smaller" type imputation, but for topmed thats going to end up being like 300M ish points, which is kind of a lot to draw.

Try to cut down the info file so it is just one column for MAF and one for Rsq? Maybe try to split it further into a file for each MAF bin then use that? Basically anyway you can split it up.