Entering edit mode
3.6 years ago
cocchi.e89
▴
290
Quick question, I splitted a multiallelic VCF file with bcftools
:
bcftools norm -m -any <IN.vcf> -OV > <OUT.vcf>
and then divided SNP from INDEL with GATK SelectVariants
:
gatk SelectVariants \
-R <REFERENCE.fasta> \
-V <OUT.vcf> \
--select-type-to-include SNP \
-O <OUT.SNP.vcf>
But I noticed that this SNP-only VCF includes spanning/overlapping deletions (* allele) as SNP. As example:
chr1 10443 . C * 54.40 VQSRTrancheSNP99.90to100.00 AC=1;AF=0.038;AN=26;ANN=T|upstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000450305|transcribed_unprocessed_pseudogene|||||||||||1567|1||sequence_alteration|HGNC|HGNC:37102||||chr1:g.10443C>T,T|upstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000456328|processed_transcript|||||||||||1426|1||sequence_alteration|HGNC|HGNC:37102|YES|||chr1:g.10443C>T,T|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000488147|unprocessed_pseudogene|||||||||||3961|-1||sequence_alteration|HGNC|HGNC:38034|YES|||chr1:g.10443C>T;BaseQRankSum=1.83;DP=336;ExcessHet=0.202;FS=3.31;InbreedingCoeff=0.4448;MLEAC=2;MLEAF=0.077;MQ=30.71;MQRankSum=0;PG=0,8,19;QD=9.07;ReadPosRankSum=0.842;SOR=0.105;VQSLOD=-7.763;culprit=MQ GT:AD:DP:FT:GQ:PL:PP 0/0:27,0:27:PASS:32:0,24,360:0,32,379 0/0:20,0:20:lowGQ:8:0,0,161:0,8,180 0/0:29,0:29:lowGQ:8:0,0,654:0,8,673 0/0:24,0:24:lowGQ:8:0,0,458:0,8,477 0/0:12,0:12:PASS:35:0,27,405:0,35,424 1/0:2,2:6:PASS:55:136,65,63:118,55,64 0/0:22,0:22:PASS:59:0,51,765:0,59,784 0/0:43,0:43:lowGQ:8:0,0,653:0,8,672 0/0:42,0:42:lowGQ:8:0,0,810:0,8,829 0/0:32,0:32:lowGQ:8:0,0,410:0,8,429 0/0:36,0:36:PASS:38:0,30,846:0,38,865 0/0:28,0:28:PASS:59:0,51,765:0,59,784 0/0:15,0:15:lowGQ:8:0,0,265:0,8,284
I think this is incorrect, aren't those supposed to be DEL? Or am I wrong?
Thank you in advance for any help!
Thanks so much. So can I consider a haploid region as SNP?
well, that variant
chr1 10443 . C *
is meaningless without the associated ALT. It should be discarded.If
norm
is removing the orginal indel, doesn't it suggest that the variant remaining afternorm
is a 1bp deletion, but that now it is misnomered with a C -> * ? Trying to understand how would a indel be deleted by normI am having a similar issue, where I find variants from a phased trio VCF being denoted by
*
but without an upstream deletion on that same phased allele. I want to get rid of these SNPs but not sure how.Here is an example. In the trio VCF - so 2 parents and the patient - I find:
and then, when I just extract the patient genotypes using bcftools query:
see my question: Removing / Excluding / Collapsing Overlapping Indels