Duplicate results in vardict vcf
1
0
Entering edit mode
3.5 years ago
ww22runner ▴ 60

Hello everyone,

I am using Vardict to look for SNVs in an unpaired sample and this is the command I am using:

$VARDICT -G $REF_GENOME_b37 -f 0.03 -N "my_sample" -b "${SAMPLE_PFX}_UN.bam" -r 5 -z 1 -t -c 1 -S 2 -E 3 -g 4 $BEDFILE|sed '1d'|teststrandbias.R|var2vcf_valid.pl -N "unpaired" -f 0.03 -v 5 > ${SAMPLE_PFX}_vardict.vcf

However, whether I use the -t flag or -F 0x500 flag, I keep getting duplicate result rows in my vcf:

enter image description here

Both rows are the same and I cannot understand why the duplicate was not filtered, any advice would be greatly appreciated!

Thank you

vardict • 1.4k views
ADD COMMENT
2
Entering edit mode

Do not open VCF files in Excel. Instead, get those two lines on the command line and use uniq to find if they have the same content. If not, you have a lead to look for cause of difference. If they have the exact same content, an expert on VarDict would be in a better place to answer your question.

ADD REPLY
0
Entering edit mode

Hello Ram, thank you for your suggestion, they are the same content. Somehow, using the --deldupvar option removes the duplicate but the -t option or -F 0x504, both of which should remove duplicates do not work. If you know the difference between these options, please do let me know, thanks.

ADD REPLY
1
Entering edit mode
3.5 years ago
sbstevenlee ▴ 480

If you are a Python user, you may want to check out the pyvcf submodule I wrote. If the duplicate rows are truly duplicate in the sense that all the fields are identical, you can clean up your VCF file this way:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2', 'chr2'],
...     'POS': [100, 101, 101],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'T'],
...     'ALT': ['A', 'C', 'C'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'Steven': ['0/1', '1/1', '1/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('your_file.vcf')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr2  101  .   T   C    .      .    .     GT    1/1
2  chr2  101  .   T   C    .      .    .     GT    1/1
>>> df = vf.df.drop_duplicates()
>>> filtered_vf = pyvcf.VcfFrame([], df)
>>> filtered_vf.to_file('filtered.vcf')
>>> filtered_vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr2  101  .   T   C    .      .    .     GT    1/1

If your duplicate rows have the same POS, REF, ALT, ... but have slightly different genotype data for some reason, above will not work. Let me know if this is the case.

ADD COMMENT
0
Entering edit mode

Hi sbstevenlee,

Thank you, I shall try this out.

ADD REPLY

Login before adding your answer.

Traffic: 1516 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6