Gatk Ouput : Duplication On Same Position
2
0
Entering edit mode
11.2 years ago
khikho ▴ 100

Is there any explanition for having these two lines on the same position? and Is there any way to pick almost the best one in this case automaticly?

21      26039812        .       G       .       90.19   .       GT:DP:GQ:PL:A:C:G:T:IR  0/0:20:60.20:0,60,807:0,0:0,0:14,6:0,0:8
21      26039812        .       GATAT   G       384.15  .      GT:DP:GQ:PL:A:C:G:T:IR  1/1:20:27.09:426,27,0:0,0:0,0:14,6:0,0:8

Thank you in advance.

gatk vcf • 3.0k views
ADD COMMENT
2
Entering edit mode

well, as far as I can see the first line is 0/0 and there is no ALT, so it's not a variation...

ADD REPLY
0
Entering edit mode

Pierre is correct. The first one is not a variation but in case it would have been a variation, then you should go with one with the highest score (i.e. 384.15 or second in this case). Many times you will find that variant caller has called a SNP and a short indel at the same position, the variant quality score can be used to select one of them.

ADD REPLY
1
Entering edit mode
11.2 years ago
Erik Garrison ★ 2.4k

You can filter out variants not found in any samples in your data set this way using vcffixup or vcffilter:

[vcf stream] | vcffixup - | vcffilter -f "AC > 0"

However, there is a deeper problem with the example you posted. It represents an impossible picture of the variation at the locus. Is the sample homozygous reference or does it have a homozygous deletion at the locus? I suggest you figure out what is meant by the overlapping reference call before simply picking the best one.

This ambiguity presents basic problems for interpretation. If removing such ambiguity from your calls is important to your research, then I suggest you try out a haplotype-based method like freebayes or platypus. A number of de novo assembly methods will also correctly provide this information.

ADD COMMENT
1
Entering edit mode
11.2 years ago
vdauwera ★ 1.2k

This looks like it was generated using the GATK's UnifiedGenotyper "emit all sites" mode. The first record is the ref call indicating there is no SNP at that site. The second record is an indel call. They are different calls, hence different records. If you don't want this to happen, don't use "emit all sites". Or use the GATK's newer caller, called HaplotypeCaller, which is haplotype-based as the name implies.

ADD COMMENT

Login before adding your answer.

Traffic: 1607 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6