Hello,
I am trying to do something very simple, but running into confusing behaviour. I have a VCF file of multiple samples and want to remove all indels so that I can generate sequences with identical coordinates with bcftools consensus. I removed indels by specifying
bcftools view --include 'TYPE="snp" || TYPE="ref"' -O z -o newfile.vcf.gz filename.vcf.gz
I expected this to keep only SNPs or invariant sites. However, there are still some indels left in my file, for example, this position still has an indel, which breaks my downstream analyses:
chr_24 805548 . A . 14168.2 . DP=1074;MQ0F=0;AN=102;DP4=550,524,0,0;MQ=59;F_MISSING=0.271429 GT:DP:SP:AD 0/0:7:0:7 0/0:10:0:10 ./.:1:0:1 ./.:1:0:1 ./.:1:0:1 ./.:2:0:2 ./.:1:0:1 0/0:8:0:8 0/0:11:0:11 0/0:29:0:29 0/0:8:0:8 ./.:0:0:0 0/0:57:0:57 0/0:14:0:14 0/0:7:0:7 0/0:6:0:6 0/0:38:0:38 0/0:21:0:21 0/0:9:0:9 0/0:8:0:8 0/0:17:0:17 ./.:2:0:2 0/0:35:0:35 0/0:23:0:23 0/0:13:0:13 ./.:4:0:4 0/0:10:0:10 ./.:3:0:3 0/0:5:0:5 0/0:53:0:53 ./.:3:0:3 ./.:2:0:2 0/0:33:0:33 0/0:48:0:48 0/0:37:0:37 0/0:42:0:42 0/0:35:0:35 0/0:26:0:26 0/0:28:0:28 0/0:40:0:40 ./.:3:0:3 0/0:7:0:7 0/0:27:0:27 0/0:9:0:9 0/0:11:0:11 0/0:10:0:10 ./.:1:0:1 0/0:13:0:13 0/0:6:0:6 0/0:10:0:10 0/0:9:0:9
./.:2:0:2 ./.:1:0:1 ./.:0:0:0 0/0:32:0:32 0/0:12:0:12 0/0:9:0:9 0/0:8:0:8 ./.:2:0:2 0/0:6:0:6 ./.:4:0:4 0/0:7:0:7 0/0:15:0:15 0/0:35:0:35 0/0:29:0:29 0/0:7:0:7 0/0:24:0:24 0/0:10:0:10 ./.:4:0:4 0/0:11:0:11
chr_24 805548 . ATCT . 35.6358 . INDEL;IDV=21;IMF=1;DP=1074;VDB=2.39482e-42;SGB=133.525;RPBZ=2.03077;MQBZ=-3.57201;MQSBZ=-1.77777;BQBZ=-1.89605;SCBZ=-1.51717;MQ0F=0;AN=100;DP4=512,464,36,51;MQ=59;F_MISSING=0.285714 GT:DP:SP:AD 0/0:7:0:7 0/0:10:0:3 ./.:1:0:1 ./.:1:0:1 ./.:1:0:0 ./.:2:0:2 ./.:1:0:1 0/0:8:0:6
0/0:11:0:11 0/0:29:0:29 0/0:8:0:8 ./.:0:0:0 0/0:57:0:57 0/0:14:16:9 0/0:7:0:7 0/0:6:0:6 0/0:38:0:38 0/0:21:0:0 0/0:9:0:9 0/0:8:3:4 0/0:17:0:17 ./.:2:0:2 0/0:35:0:35 0/0:21:0:8 0/0:13:0:13 ./.:4:0:4 ./.:4:0:4 ./.:3:0:3 0/0:5:0:5 0/0:53:0:53 ./.:3:0:3 ./.:2:0:2 0/0:33:0:33 0/0:48:0:48 0/0:37:0:37 0/0:42:0:42 0/0:35:0:35 0/0:26:0:26 0/0:28:0:28 0/0:40:0:40 ./.:3:0:3 0/0:7:0:7 0/0:26:6:10 0/0:9:0:9 0/0:11:0:11 0/0:10:0:10 ./.:1:0:1 0/0:13:0:13 0/0:6:0:6 0/0:10:0:10 0/0:9:0:9 ./.:2:0:2 ./.:1:0:1 ./.:0:0:0 0/0:32:0:32 0/0:12:5:7 0/0:9:0:9 0/0:8:0:8 ./.:2:0:2 0/0:6:0:6 ./.:4:0:4 0/0:7:0:7 0/0:15:0:15 0/0:35:0:35 0/0:29:0:29 0/0:7:0:7 0/0:24:0:13
0/0:9:0:7 ./.:4:0:4 0/0:11:0:11
Testing TYPE="snp" and TYPE="ref" separately, it seems that it is kept via TYPE="ref". Does anyone know why this indel made it through the --include 'TYPE="ref"' filter, or what I can change to stop it from making it through the filter?
Thank you!
Thank you! Strange that the formatting is not proper, I generated the VCF with bcftools and haven't modified it with any other program.
so i may be wrong on that point.
I'm finding the same issue in my data. If Pierre is right, this might be a bug in the program. I've posted an 'issue' request on the developer's GitHub: https://github.com/samtools/bcftools/issues/2299
Maybe it is related with indels left-alignment? I mean the "." in your alt alleles.