Entering edit mode
2.8 years ago
Salahuddin
•
0
Hi,
I have 22 chromosome .bgen files separately. For PRS, i want to merge them using -pmerge. For that I need bcf formats. So basically I'm at 1st stage, converting .bgen to bcf and quality control. But after giving command, I'm getting error message -
Error: Only VCF, oxford, bgen-1.x, haps, hapslegend, A, AD, A-transpose, and ind-major-bed output have been implemented so far.
My command -
plink2 \
--bgen /mnt/d/ukb_genetic_data/ukb_imp_chr1_v3.bgen ref-first \
--sample /mnt/d/ukb_genetic_data/ukb30172_imp_chr22_v3_s487324.sample \
--maf 0.01 \
--hwe 1e-6 \
--geno 0.01 \
--mind 0.01 \
--export bcf \
--out chr101
Thanks for your help.
What exactly do you want to do? You want to merge the bgen file? Or do you want to convert bgen files into VCF formats? Most PRS software I know of does not support bcf / vcf formats.
For PRS analysis, I want to merge the bgen files.
Hey Sam, Maybe it will help more, If I write in details. This is my first time using PRSice. I want to do PRS on UKB dataset of a specific trait. I have 22 .bgen, 22 .bgi and 1 sample file (1 for each chromosome = 22) in hand. To make the target file ready I understood, I need to quality control these 22 .bgen file and merge them into one file. To merge the files, I found (https://groups.google.com/g/plink2-users/c/hkgxoyjBbWg/m/4YTauyQrAgAJ) i need to do -pmerge. Correct me if I'm wrong.
If you are using PRSice, you can do
--target ukb_chr#_v3
and we can automatically loop through the bgen files. As for QC, as you are doing UKB, there should be a mgi or mxx file (I forgot the suffix) that contain the MAF and INFO score which you can use for filtering. Though I'd recommend first running the PRS analyses with the genotype file as most of the time, imputed data does not add substantial amount of signal, but takes way longer to run.Thanks for your reply. Using --target command is a good idea. But to be clear, do I not need to merge the files then? So I have Quality controlled 22 files and all generated files 22 copies all the way PRS analysis?
PRSice will automatically loop through all 22 chromosomes, if that is what you meant. You can just provide a single SNP list containing the SNP ID of those that passed QC.
Thanks for your reply. I'm a little confused with the QC step. From PRSice manual (https://www.prsice.info/quick_start/#quality-control-of-target-samples) I came to understand, the target samples need to be quality controlled using PLINK with following command plink --bfile ($target) \ --maf 0.05 \ --mind 0.1 \ --geno 0.1 \ --hwe 1e-6 \ --make-just-bim \ --make-just-fam \ --out ($target).qc Then, --keep ($target).qc.fam --extract ($target).qc.bim can be added to the PRSice command to filter out the samples and SNPs.
Is there other alternative that you are suggesting? With UKB application, I only have bgen, bgi and sample file.
If you are using UK Biobank, the minor allele frequency, info score and all those information were already computed and can be downloaded directly from here Therefore, you don't really need to compute the data yourself.
I have downloaded the file containing MAF and INFO score ukb_imp_mfi.tgz file for filtering as you mentioned above for QC. Could you mention the command script for the filtering? Currently I'm running the following script for only PRS construction without regression -
Rscript PRSice.R --dir . ^ --prsice PRSice_win64.exe ^ --base iPSYCH-PGC_ASD_Nov2017.gz ^ --snp SNP ^ --chr CHR ^ --bp BP ^ --A1 A1 ^ --A2 A2 ^ --stat OR ^ --pvalue P ^ --target ukb_imp_chr#_v3,ukb30172_imp_chr22_v3_s487324.sample ^ --type bgen ^ --no-regress ^ --thread 1 ^ --stat OR ^ --extract PRSice.valid ^ --binary-target T ^ --bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 ^ --allow-inter ^ --fastscore ^ --score avg ^ --interval 5e-05 ^ --lower 5e-08 ^ --upper 0.5 ^ --missing mean_impute ^ --model add ^ --print-snp ^ --out trial_140222
You will need to filter out the require SNP names, then use
--extract
or--exclude
from PRSice to do the filtering.How do I filter out the require SNP names?I tried with PRSice with following command:
Rscript PRSice.R --dir . ^ --prsice PRSice_win64.exe ^ --target ukb_imp_chr#_v3,ukb30172_imp_chr22_v3_s487324.sample ^ --type bgen ^ --geno 0.02 ^ --maf 0.01 ^ --info 0.7 ^ --print-snp ^ --out ukb.QC
But I get error asking for Base file. After including the base file, it seems to start working on PRS score calculation. Could you help me out with the step by step commands. As I understood, with one command i do filter out SNP names, and then then use --extract or --exclude from PRSice to do the filtering. Thanks
Use
awk
orR
on the mfi file. Identify any row with INFO score and MAF larger than desired threshold. Then you can write it out to a file. e.g.awk '$6 > 0.05 && $6 < 0.95 && $8 > 0.8{print $2}' mfi > valid.snp
Will filter out SNPs with MAF > 0.5 ($6
) and info score > 0.8 ($8
) and print out the rs id ($2
). You can find the header information here. (Note that, my copy of mfi seems to have slightly different format than what was shown in the document. So do change the column number according to the format of your mfi file)Thanks a lot. It worked. Now I have this issue with --extract command. I have three commands with --extract now; one for the files with QCed SNPs of target file, another with the list of valid SNPs after clumping and another from last PRSice run .valid list. Please see my log below followed by my query- PRSice 2.3.5 (2021-09-20) https://github.com/choishingwan/PRSice (C) 2016-2020 Shing Wan (Sam) Choi and Paul F. O'Reilly GNU General Public License v3 If you use PRSice in any published work, please cite: Choi SW, O'Reilly PF. PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data. GigaScience 8, no. 7 (July 1, 2019) 2022-02-22 13:52:04 D:\ukb_genetic_data\PRSice_win64.exe \ --a1 A1 \ --a2 A2 \ --allow-inter \ --bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \ --base iPSYCH-PGC_ASD_Nov2017.gz \ --base-info INFO:0.9 \ --binary-target T \ --bp BP \ --chr CHR \ --extract combined-csv-files.snp \ --fastscore \ --geno 0.02 \ --ignore-fid \ --interval 5e-05 \ --lower 5e-08 \ --missing mean_impute \ --model add \ --no-clump \ --no-regress \ --num-auto 22 \ --or \ --out trial_220222 \ --pvalue P \ --score avg \ --seed 140070291 \ --snp SNP \ --stat OR \ --target ukb_imp_chr#_v3,ukb30172_imp_chr22_v3_s487324.sample \ --thread 1 \ --type bgen \ --upper 0.5
Initializing Genotype file: ukb_imp_chr#_v3 (bgen) With external fam file: ukb30172_imp_chr22_v3_s487324.sample
Start processing iPSYCH-PGC_ASD_Nov2017
RS_ID assume to be column containing SNP ID
Base file: iPSYCH-PGC_ASD_Nov2017.gz GZ file detected. Header of file is: CHR SNP BP A1 A2 INFO OR SE P
9112386 variant(s) observed in base file, with: 3102224 variant(s) excluded based on user input 639495 variant(s) with INFO score less than 0.900000 805870 ambiguous variant(s) excluded 4564797 total variant(s) included from base file
Loading Genotype info from target
487409 people (223006 male(s), 264318 female(s)) observed 487409 founder(s) included
Error: A total of 8166 duplicated SNP ID detected out of 4560581 input SNPs! Valid SNP ID (post --extract / --exclude, non-duplicated SNPs) stored at trial_220222.valid. You can avoid this error by using --extract trial_220222.valid
I want to generate the score again now with clumped file (snp_140222.snp) extracting the valid target snp file (combined-csv-files.snp). how can I do that. Sorry for my long post and query.
Do I use -keep for the QCed target valid.snp or -extract? You mentioned before to use -extract.
For snp, use --extract. For samples, use --keep
Can I have more than one --extract in the command? As I have clumped file (snp_140222.snp) from last run of PRS scoring and the valid target snp file (combined-csv-files.snp) after QC. I also have a .valid file to extract when I give the command. Could you check my command below -
Rscript PRSice.R --dir . ^
--prsice PRSice_win64.exe ^
--base iPSYCH-PGC_ASD_Nov2017.gz ^
--snp SNP ^
--chr CHR ^
--bp BP ^
--A1 A1 ^
--A2 A2 ^
--stat OR ^
--pvalue P ^
--target ukb_imp_chr#_v3,ukb30172_imp_chr22_v3_s487324.sample ^
--type bgen ^
--no-clump ^
--no-regress ^
--thread 1 ^
--extract PRSice.valid ^
--binary-target T ^
--bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 ^
--allow-inter ^
--fastscore ^
--score avg ^
--interval 5e-05 ^
--lower 5e-08 ^
--upper 0.5 ^
--missing mean_impute ^
--model add ^
--extract snp_140222.snp ^
--ignore-fid ^
--extract combined-csv-files.snp ^
--geno 0.02 ^
--out trial_220222
nope. You need to merge them / make one yourself
Thanks for you reply. So I make one valid snp list file by merging QCed target file snp list and clumped snp list from last run. Then --extract that file only. While merging /making, do I keep only the SNP names in that file or other information also required?
Hi Sam, Thanks for your support, I've been able to work it out. I have a question, if you could help me out. While PRScoring with BGEN imputed UKB data, I only had INFO, MAF and Geno filtering as QC. Is it enough QC steps? From your paper I see additional recommendation of 'genotyping rate >0.99' and 'Hardy-Weinberg Equilibrium P >1 × 10−6'. How to apply that while working with UKB BGEN data? And if not needed, can i get a reference to that. Thanks