how to remove positions in VCF based on rsID
1
0
Entering edit mode
2.8 years ago
raalsuwaidi ▴ 100

hi all,

I am trying to remove entries in the VCF file based on rs_id. I tried to use bcftools view but for some reason preparing a list of the positions I need to be excluded did not work.

A sample of the file looked like this:

22 rs165886
22 rs165608
22 rs1541529
22 rs4819925
22 rs5992604

and the bcftools command is the following:

bcftools view -T ^tlist.txt input.vcf

which always give an error Could not parse the file

so I tried to change the value in the plink map file to -1 like the below, and then I recoded it again using plink to vcf. even after all of that, the positions are still in the vcf file.

22  rs165886    -1  17339003
22  rs165608    -1  17339404

Can you please tell me how to fix this?

vcf plink bcftools • 1.4k views
ADD COMMENT
0
Entering edit mode

Do an inverse grep

ADD REPLY
0
Entering edit mode

You mean an inverse grep on the VCF file? I am no expert in that, can you please give me an example? Will it remove the whole line?

ADD REPLY
1
Entering edit mode

with example vcf and example file with one rsid:

$ cat test.vcf 
##fileformat=VCFv4.1
##filedate=2017.7.5
##source=Minimac3
##contig=<ID=29>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Minor Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy">
##INFO=<ID=ER2,Number=1,Type=Float,Description="Empirical (Leave-One-Out) R-square (available only for genotyped variants)">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1242658141  1364665948  1242658615
10  96633300    rs4919045   G   A   .   PASS    .   GT:DS   0|0:0.193   0|0:0.193   0|0:0.193
29  11  Chr29:11    A   G   .   PASS    .   GT:DS   0|0:0.193   0|0:0.193   0|0:0.193

$ cat test.txt 
rs4919045

$ grep -vf test.txt test.vcf

##fileformat=VCFv4.1
##filedate=2017.7.5
##source=Minimac3
##contig=<ID=29>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Minor Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy">
##INFO=<ID=ER2,Number=1,Type=Float,Description="Empirical (Leave-One-Out) R-square (available only for genotyped variants)">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1242658141  1364665948  1242658615
29  11  Chr29:11    A   G   .   PASS    .   GT:DS   0|0:0.193   0|0:0.193   0|0:0.193

If vcf is gzipped, you can use zgrep instead of regular grep.

You can also try this and I haven't tested it's performance:

$ awk -F "\t" 'FNR==NR{a[$1]++;next}!a[$3]' test.txt test.vcf
ADD REPLY
1
Entering edit mode
grep -vFwf test.txt test.vcf
ADD REPLY
0
Entering edit mode

Thanks. That worked

ADD REPLY
4
Entering edit mode
2.8 years ago
bcftools view -e 'ID=@list_of_rs_id.txt'  in.vcf
ADD COMMENT

Login before adding your answer.

Traffic: 1750 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6