I have a merged VCF file of several samples and I want to remove duplicates from the master VCF based on the Postion,Reference and alternate base. Can Some one help me in out in either python or Linux commands. Thanks
I have a merged VCF file of several samples and I want to remove duplicates from the master VCF based on the Postion,Reference and alternate base. Can Some one help me in out in either python or Linux commands. Thanks
You can sort the vcf by position with sort -k1,1 -k2,2n
, so variants at the same position will be together, then remove the duplicates with uniq
. Unfortunately this will probably fail if there is any difference in later fields like INFO or your samples so uniq won't recognise them as the same. If you think there'll only be a few dups you could always remove these columns with cut
, use the -d option in uniq to show only the duplicate lines, and then sort it out manually.
Edit: forgot to mention that you'll probably want to remove any comment lines before sorting! grep -v "^#"
will do that for you. If you do grep "^#" in.vcf > out.vcf
first it'll put them in a new file. You can then append to that once you've sorted.
Shouldn't the merged, multi sample VCF just have all of the variants once anyway? With each sample in a separate column and a genotype (GT) field to show if the variant is in each sample? Like how this example has sample 1 hom wt, 2 is het, and 3 is hom variant.
#CHROM POS ID REF ALT ... FORMAT Sample1 Sample2 Sample3
2 4370 rs6057 G A ... GT:GQ:DP:HQ 0|0:48:1:52,51 1|0:48:8:51,51 1/1:43:5:...
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
No I dont want to sort them, that will disturb my further analysis.