how can I remove duplicated variants from vcf file?
4
6
Entering edit mode
7.4 years ago
kk.mahsa ▴ 150

How can I remove duplicated variants from vcf file? I googled and searched in biostars history but I did not find any way to do it.

SNP vcf • 21k views
ADD COMMENT
6
Entering edit mode

Take 1000 Genome phase 3 data as the example:

bcftools norm -d both --threads=32 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -O z  -o chr1.vcf.gz
ADD REPLY
1
Entering edit mode

Terrible, still have duplicates: bcftools norm

Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commandstreat these as missing.
Error: Duplicate ID '.'.
ADD REPLY
3
Entering edit mode

Would you please explain more what do you mean by duplicated variants? Do you observe two lines in your VCF file that are exactly the same?

ADD REPLY
0
Entering edit mode

yes i mean is what you say and i want to keep one of the duplicate variants and remove the rest. in fact i intend to remove variants that are same in scoffold id and pos and keep one of them.

ADD REPLY
4
Entering edit mode
7.4 years ago

. in fact i intend to remove variants that are same in scoffold id and pos and keep one of them.

I strongly suggest you also use the REF information...

sort on CHROM/POS/REF. using awk create a KEY=CHROM\tPOS\REF, print the line if the key wasn't found previously

LC_ALL=C sort -t $'\t' -k1,1 -k2,2n -k4,4  input.vcf |\
awk -F '\t' '/^#/ {print;prev="";next;} {key=sprintf("%s\t%s\t%s",$1,$2,$4);if(key==prev) next;print;prev=key;}'

edit: added 'next; ' for VCF header.

ADD COMMENT
0
Entering edit mode

thanks Pierre for your answer, i ran your cammand and get an vcf file as output but when used bcftools stats i got this error.

Failed to open output.vcf: unknown file type

why bcftools can not regognize output as a vcf file? i need to output file for downstream analysis as vcf file

ADD REPLY
3
Entering edit mode

ah yes, sorry it's because, sort messed-up the VCF header and ##fileformat= is not anymore the first line.

please try:

( grep  '^#' input.vcf ; grep -v "^#" input.vcf | LC_ALL=C sort -t $'\t' -k1,1 -k2,2n -k4,4 | awk -F '\t' 'BEGIN{ prev="";} {key=sprintf("%s\t%s\t%s",$1,$2,$4);if(key==prev) next;print;prev=key;}' )  > out.vcf
ADD REPLY
0
Entering edit mode

your answer was really helpfull, thank you so much Pierre. it worked

ADD REPLY
1
Entering edit mode
7.4 years ago

use vcfuniq or bcftools norm (with -d option) to remove duplicates

ADD COMMENT
1
Entering edit mode

bcftools norm left-align and normalize indels

Yes. It is left-align the alleles and then if the start coordinate is same then remove one of them, right?

ADD REPLY
0
Entering edit mode

thank you capd0112, i used bcftools norm and it worked.

ADD REPLY
0
Entering edit mode

bcftools normis new to me, thanks !

ADD REPLY
0
Entering edit mode
4.3 years ago

More options (just adding to keep threads linked based on common information): A: Remove duplicate SNPs only based on SNP ID in bcftools

Kevin

ADD COMMENT
0
Entering edit mode
3.0 years ago

Do you know how to remove duplicated variants from vcf file(ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz), I am struggling for it.

ADD COMMENT

Login before adding your answer.

Traffic: 1780 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6