Question

Filtering Imputation output: how to filter on a VCF INFO field

5

Entering edit mode

8.7 years ago

mmagoo6 ▴ 80

I used the Michigan Imputation Server to impute data that I have, and got three files per chromosome as output: .dose.vcf.gz, .dose.vcf.gz.tbi, and .info.gz. I only want to keep genotypes that are imputed with an R2 that is greater than 0.3. The .dose.vcf.gz has everything (I think) that I need to create a plink output file with only the high quality genotypes, but I am having a hard time getting vcftools to understand what I am asking it to do.

vcftools --vcf chr1.dose.vcf.gz --filter MinR2=.30 --plink --out plink_chr1
Error: Unknown option: --filter

Can someone please help me figure out how to filter on the value of an INFO field?

I have looked at the vcftools documentation and examples, and haven't yet figured out how to filter on INFO. The info filtering mentioned here only filters if the field exists at all, not for particular values of it. Any suggestions would be appreciated. is VCFtools not the correct tool for this?

Here is the header of the .dose.vcf.gz:

##fileformat=VCFv4.1
##filedate=2016.8.6
##source=Minimac3
##contig=<ID=1>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1 ">
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Minor Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy">
##INFO=<ID=ER2,Number=1,Type=Float,Description="Empirical (Leave-One-Out) R-square (available only for genotyped variants)">

imputation filtering minimac3 plink vcftools • 19k views

ADD COMMENT • link updated 6.7 years ago by natalia.han • 0 • written 8.7 years ago by mmagoo6 ▴ 80

0

Entering edit mode

Filtering Vcf File

ADD REPLY • link 8.7 years ago by microfuge ★ 2.0k

score 3 · Answer 1 · 2016-08-10

3

Entering edit mode

8.7 years ago

mmagoo6 ▴ 80

OK, so I figured it out with plink. I removed all multi-allelic variants, and variants with the same name (which might not be ideal for everyone), but in short, I went from the michigan server output to a single plink file of everything with R2>0.3 with the help of this hack from Pontikos' github in this way:

#sign in to the michigan server on chrome, go to the "job" section and pick the job of interest
#Go to the "results" tab
#In the chrome development "console" type  copy(document.body.innerHTML);
#in nano paste that information into download_page.html
#paste the one-time-use password into password.txt
python extract.py >urls.txt
for x in `cat urls.txt`; do wget $x; done;
password=`cat password.txt`
for x in *.zip
do
    echo unzip -P $password -o $x
    unzip -P $password -o $x
done
for chnum in {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22};
  do
  gunzip chr$chnum.info.gz
  plink --vcf chr$chnum.dose.vcf.gz --make-bed --out s1
  plink --bfile s1 --bmerge s1 --merge-mode 6
  plink --bfile s1 --exclude plink.missnp --make-bed --out s2
  plink --bfile s2 --list-duplicate-vars
  plink --bfile s2 --exclude plink.dupvar --make-bed --out s3
  plink --bfile s3 --qual-scores chr$chnum.info 7 1 1 --qual-threshold 0.3 --make-bed --out ../plinkout/chr$chnum
done
#create file merge.list that contains the directory of each chromosome
plink --bfile ../plinkout/1 merge-list merge.list --make-bed --out imputepostqc

ADD COMMENT • link 8.7 years ago by mmagoo6 ▴ 80

3

Entering edit mode

Hi, thank you for your post. It helps me a lot with filtering dose.vcf.gz and converting them to plink format. However, I found the filtered files have far less number of rows than the files filtered by:

bcftools view -i 'R2>.8 & MAF>.05' -Oz chrnum.dose.vcf.gz > chrnum.filtered.vcf.gz

And I think converting the dose.vcf.gz to plink in this way leads to the lost the dosage information. Maybe need to use software like DosageConvertor to convert dose.vcf.gz to plink dosage files.

ADD REPLY • link 6.7 years ago by Sunshine n Rain ▴ 40

0

Entering edit mode

Thanks for this comment @Sunshine n Rain, it's nice and easy to filter out SNPs from vcf files according to MAF and R2 values using this command!!

ADD REPLY • link 5.9 years ago by rodd ▴ 250

0

Entering edit mode

its also possible to filter the vcf.gz file directly using bcftools filter -e

ADD REPLY • link 8.7 years ago by vakul.mohanty ▴ 270

0

Entering edit mode

Bcftools' filter removes sites, not individual genotypes.

ADD REPLY • link 8.4 years ago by luisrmacias • 0

0

Entering edit mode

Do you have any recommendations for the filtering value of GP

##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1).

I use no 0.85 however not sure how scientific that value is... Any ideas or suggestions?

ADD REPLY • link 8.5 years ago by Floris Brenk ★ 1.0k

0

Entering edit mode

There is an error in the --qual-threshold step. Error: Duplicate variant '1:871334' in --qual-scores file

ADD REPLY • link 7.0 years ago by kvshamsudheen ▴ 120

score 1 · Answer 2 · 2016-09-21

1

Entering edit mode

8.6 years ago

fengjunjiao54 ▴ 10

Thank you for your sharing. I have try Pontikos' github you pasted. however, I wander what is the extract.py ,is there another python script? for I get the error: python extract.py >urls.txt syntaxEror : invalid syntax. any suggestions will be helpful for me .

Thank you !

ADD COMMENT • link 8.6 years ago by fengjunjiao54 ▴ 10

0

Entering edit mode

The extract.py seems like a script to extract the download urls. You can ignore that line and take what else you need.

ADD REPLY • link 8.4 years ago by Tao ▴ 540

score 0 · Answer 3 · 2016-12-11

0

Entering edit mode

8.4 years ago

Tao ▴ 540

Hi, This post is very helpful for me as I am also looking for protocols for QC imputation results. Besides the QC steps you mentioned above, I wonder if you add other QC steps? If yes, please kindly tell me. Thanks!

Tao

ADD COMMENT • link 8.4 years ago by Tao ▴ 540

score 0 · Answer 4 · 2017-12-01

0

Entering edit mode

7.4 years ago

shraddha.pai • 0

Pontiko's code snippet is very useful. In the gist above, what is this line achieving? Why would you want to merge a file with itself? plink --bfile s1 --bmerge s1 --merge-mode 6

ADD COMMENT • link 7.4 years ago by shraddha.pai • 0

0

Entering edit mode

That is for detecting non-biallelic variants. For example, one SNP might has 3 alleles: A -> C, A -> G. And the this step will report such SNPs as mis-match SNPs.

ADD REPLY • link 7.4 years ago by Tao ▴ 540

score 0 · Answer 5 · 2018-08-22

0

Entering edit mode

6.7 years ago

natalia.han • 0

I believe you can also do this in order to only get biallelic variants:

  plink --vcf chr$chnum.dose.vcf.gz --make-bed --double-id --biallelic-only  --out s1

ADD COMMENT • link 6.7 years ago by natalia.han • 0