Question

PRSice on Imputation vcf.gz data; conversion of files

0

Entering edit mode

3.7 years ago

kstafford32 • 0

Hello,

I conducted imputation using TOPMED, and received an output of vcf.gz files separated for each 22 chromosomes. I would like to now compute a PRS analysis using PRSice-2.

How can I properly convert my vcf.gz files to concated plink binary file or concated .bgen file and retain the snpIDs, pvalues, and alleles column?

According to the PRSice tutorial as well as other forums, PRSice does not accept vcf.gz files, only plink bed files or .bgen files.

Thus, I attempted to make a concated .bgen file using this:

vcf-concat *.vcf.gz | gzip -c > imputedtopmedresults.concat.ALLchrs.vcf.gz
ml qctool
qctool -g imputedtopmedresults.concat.ALLchrs.vcf.gz -vcf-genotype-field GP -og imputedtopmedresults.concat.ALLchrs.converted.bgen

I then fed this imputedtopmedresults.concat.ALLchrs.converted.bgen file in as my base data for the PRSice code:

Rscript PRSice.R \
    --prsice ./PRSice_linux \
    --base imputedtopmedresults.concat.ALLchrs.converted.bgen \
    --target MDD.QC.gz \
    --thread 1 \
    --stat BETA \
    --beta \
    --binary-target F

This error was returned:

Error: Column for the effective allele must be provided!
Error: Column for the SNP ID must be provided!
Error: Column for the P-value must be provided!

During the conversion from vcf.gz to .bgen, it was clear that my snp-id's pvalues, and alleles were not retained. I then tried to convert my vfc.gz files using another method, to plink binary files:

for i in {1..22}; do
bcftools norm -Ob -m-any chr$i.dose.vcf.gz > chr$i.dose.bcf
done

for i in {1..22}; do
bcftools index chr$i.dose.bcf
done

ml plink
for i in {1..22}; do
plink --bcf chr$i.dose.bcf --const-fid 0 --make-bed --out chr$i_ped; done

I fed the plink binary file into PRSice and the same error occurred.

I went back to check the vcf.gz file and these headers are there:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT

How can I properly convert my vcf.gz files to concated plink binary file or concated .bgen file and retain the snpIDs, pvalues, and alleles column?

Or perhaps TOPMED doesn't provide pvalues, etc., and I am missing something here...?

Thank you

prisce imputation vcf plink bgen • 1.9k views

ADD COMMENT • link updated 3.7 years ago by Sam ★ 4.8k • written 3.7 years ago by kstafford32 • 0

score 1 · Answer 1 · 2021-03-08

Hi,

The main problem is that you have mixed up what a base and target files are. The base file is the summary statistic file, which contain the effect size estimation of the trait of interest (based on your script, I would guess that it is the MDD.QC.gz file). As for the input, you don't actually need to concatenate the bgen file (assuming this is the target file which you wish to calculate the PRS on). All you need is to do --target chr# --type bgen for PRSice to automatically look for chr1.bgen ... chr22.bgen.

Also, when you use bgen as input, and if you are using version 2.3.3 (the latest version), make sure you use --allow-inter flag, as there seems to be a bug that prevents PRSice to run without that flag.

Sam