If you are a Python user, you may want to check out the pyvcf
submodule I wrote. If the duplicate rows are truly duplicate in the sense that all the fields are identical, you can clean up your VCF file this way:
>>> from fuc import pyvcf
>>> data = {
... 'CHROM': ['chr1', 'chr2', 'chr2'],
... 'POS': [100, 101, 101],
... 'ID': ['.', '.', '.'],
... 'REF': ['G', 'T', 'T'],
... 'ALT': ['A', 'C', 'C'],
... 'QUAL': ['.', '.', '.'],
... 'FILTER': ['.', '.', '.'],
... 'INFO': ['.', '.', '.'],
... 'FORMAT': ['GT', 'GT', 'GT'],
... 'Steven': ['0/1', '1/1', '1/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('your_file.vcf')
>>> vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0 chr1 100 . G A . . . GT 0/1
1 chr2 101 . T C . . . GT 1/1
2 chr2 101 . T C . . . GT 1/1
>>> df = vf.df.drop_duplicates()
>>> filtered_vf = pyvcf.VcfFrame([], df)
>>> filtered_vf.to_file('filtered.vcf')
>>> filtered_vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0 chr1 100 . G A . . . GT 0/1
1 chr2 101 . T C . . . GT 1/1
If your duplicate rows have the same POS, REF, ALT, ... but have slightly different genotype data for some reason, above will not work. Let me know if this is the case.
Do not open VCF files in Excel. Instead, get those two lines on the command line and use
uniq
to find if they have the same content. If not, you have a lead to look for cause of difference. If they have the exact same content, an expert on VarDict would be in a better place to answer your question.Hello Ram, thank you for your suggestion, they are the same content. Somehow, using the
--deldupvar
option removes the duplicate but the-t
option or-F 0x504
, both of which should remove duplicates do not work. If you know the difference between these options, please do let me know, thanks.