GATK SelectVariants consider spanning/overlapping deletions as SNP
2
0
Entering edit mode
3.6 years ago
cocchi.e89 ▴ 290

Quick question, I splitted a multiallelic VCF file with bcftools:

bcftools norm -m -any <IN.vcf> -OV > <OUT.vcf>

and then divided SNP from INDEL with GATK SelectVariants:

gatk SelectVariants \
 -R <REFERENCE.fasta> \
 -V <OUT.vcf> \
 --select-type-to-include SNP \
 -O <OUT.SNP.vcf>

But I noticed that this SNP-only VCF includes spanning/overlapping deletions (* allele) as SNP. As example:

chr1    10443   .   C   *   54.40   VQSRTrancheSNP99.90to100.00 AC=1;AF=0.038;AN=26;ANN=T|upstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000450305|transcribed_unprocessed_pseudogene|||||||||||1567|1||sequence_alteration|HGNC|HGNC:37102||||chr1:g.10443C>T,T|upstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000456328|processed_transcript|||||||||||1426|1||sequence_alteration|HGNC|HGNC:37102|YES|||chr1:g.10443C>T,T|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000488147|unprocessed_pseudogene|||||||||||3961|-1||sequence_alteration|HGNC|HGNC:38034|YES|||chr1:g.10443C>T;BaseQRankSum=1.83;DP=336;ExcessHet=0.202;FS=3.31;InbreedingCoeff=0.4448;MLEAC=2;MLEAF=0.077;MQ=30.71;MQRankSum=0;PG=0,8,19;QD=9.07;ReadPosRankSum=0.842;SOR=0.105;VQSLOD=-7.763;culprit=MQ    GT:AD:DP:FT:GQ:PL:PP    0/0:27,0:27:PASS:32:0,24,360:0,32,379   0/0:20,0:20:lowGQ:8:0,0,161:0,8,180 0/0:29,0:29:lowGQ:8:0,0,654:0,8,673 0/0:24,0:24:lowGQ:8:0,0,458:0,8,477 0/0:12,0:12:PASS:35:0,27,405:0,35,424   1/0:2,2:6:PASS:55:136,65,63:118,55,64   0/0:22,0:22:PASS:59:0,51,765:0,59,784   0/0:43,0:43:lowGQ:8:0,0,653:0,8,672 0/0:42,0:42:lowGQ:8:0,0,810:0,8,829 0/0:32,0:32:lowGQ:8:0,0,410:0,8,429 0/0:36,0:36:PASS:38:0,30,846:0,38,865   0/0:28,0:28:PASS:59:0,51,765:0,59,784   0/0:15,0:15:lowGQ:8:0,0,265:0,8,284

I think this is incorrect, aren't those supposed to be DEL? Or am I wrong?

Thank you in advance for any help!

SelectVariants SNP gatk INDEL • 1.6k views
ADD COMMENT
2
Entering edit mode
3.6 years ago

I think this is incorrect, aren't those supposed to be DEL? Or am I wrong?

it's not an indel, it's IN an indel (!). it is a local haploid region with a variant (you removed the ALT allele with norm) but there should a variant with a large deletion upstream of "chr1 10443"

ADD COMMENT
0
Entering edit mode

Thanks so much. So can I consider a haploid region as SNP?

ADD REPLY
0
Entering edit mode

well, that variant chr1 10443 . C * is meaningless without the associated ALT. It should be discarded.

ADD REPLY
0
Entering edit mode

If norm is removing the orginal indel, doesn't it suggest that the variant remaining after norm is a 1bp deletion, but that now it is misnomered with a C -> * ? Trying to understand how would a indel be deleted by norm

ADD REPLY
0
Entering edit mode

I am having a similar issue, where I find variants from a phased trio VCF being denoted by * but without an upstream deletion on that same phased allele. I want to get rid of these SNPs but not sure how.

Here is an example. In the trio VCF - so 2 parents and the patient - I find:

#CHROM  POS      REF     ALT
chr1    154590147  CCG     C
chr1    154590148  CG      C
chr1    154590149  G       *
chr1    154590149  G       C

and then, when I just extract the patient genotypes using bcftools query:

#CHROM  POS         REF   ALT     GT
chr1     154590148   CG  C      0|1
chr1     154590149   G   *      1|0
chr1     154590149   G   C      0|1

see my question: Removing / Excluding / Collapsing Overlapping Indels

ADD REPLY

Login before adding your answer.

Traffic: 1865 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6