Question

Removing duplicate variants from the VCF file using Position,Reference base and Alternative base

0

Entering edit mode

8.4 years ago

genie66 ▴ 30

I have a merged VCF file of several samples and I want to remove duplicates from the master VCF based on the Postion,Reference and alternate base. Can Some one help me in out in either python or Linux commands. Thanks

vcf duplicate • 5.6k views

ADD COMMENT • link updated 8.4 years ago by 13en ▴ 90 • written 8.4 years ago by genie66 ▴ 30

score 1 · Answer 1 · 2016-07-07

You can sort the vcf by position with sort -k1,1 -k2,2n, so variants at the same position will be together, then remove the duplicates with uniq. Unfortunately this will probably fail if there is any difference in later fields like INFO or your samples so uniq won't recognise them as the same. If you think there'll only be a few dups you could always remove these columns with cut, use the -d option in uniq to show only the duplicate lines, and then sort it out manually.

Edit: forgot to mention that you'll probably want to remove any comment lines before sorting! grep -v "^#" will do that for you. If you do grep "^#" in.vcf > out.vcf first it'll put them in a new file. You can then append to that once you've sorted.

Shouldn't the merged, multi sample VCF just have all of the variants once anyway? With each sample in a separate column and a genotype (GT) field to show if the variant is in each sample? Like how this example has sample 1 hom wt, 2 is het, and 3 is hom variant.

#CHROM POS    ID        REF  ALT     ...    FORMAT      Sample1        Sample2        Sample3
2      4370   rs6057    G    A       ...    GT:GQ:DP:HQ 0|0:48:1:52,51 1|0:48:8:51,51 1/1:43:5:...