Removing duplicate variants from the VCF file using Position,Reference base and Alternative base
1
0
Entering edit mode
8.4 years ago
genie66 ▴ 30

I have a merged VCF file of several samples and I want to remove duplicates from the master VCF based on the Postion,Reference and alternate base. Can Some one help me in out in either python or Linux commands. Thanks

vcf duplicate • 5.6k views
ADD COMMENT
1
Entering edit mode
8.4 years ago
13en ▴ 90

You can sort the vcf by position with sort -k1,1 -k2,2n, so variants at the same position will be together, then remove the duplicates with uniq. Unfortunately this will probably fail if there is any difference in later fields like INFO or your samples so uniq won't recognise them as the same. If you think there'll only be a few dups you could always remove these columns with cut, use the -d option in uniq to show only the duplicate lines, and then sort it out manually.

Edit: forgot to mention that you'll probably want to remove any comment lines before sorting! grep -v "^#" will do that for you. If you do grep "^#" in.vcf > out.vcf first it'll put them in a new file. You can then append to that once you've sorted.

Shouldn't the merged, multi sample VCF just have all of the variants once anyway? With each sample in a separate column and a genotype (GT) field to show if the variant is in each sample? Like how this example has sample 1 hom wt, 2 is het, and 3 is hom variant.

#CHROM POS    ID        REF  ALT     ...    FORMAT      Sample1        Sample2        Sample3
2      4370   rs6057    G    A       ...    GT:GQ:DP:HQ 0|0:48:1:52,51 1|0:48:8:51,51 1/1:43:5:...
ADD COMMENT
0
Entering edit mode

No I dont want to sort them, that will disturb my further analysis.

ADD REPLY

Login before adding your answer.

Traffic: 2564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6