I have a multi-vcf file with various samples of a bacterial genome, in the following format,
NC_num 20 . A T 66 . DP=1850;VDB=0.015032;SGB=16.4938;RPB=0.0719749;MQB=0.782089;MQSB=0.998951;BQB=0.976407;MQ0F=0;AC=1;AN=32;DP4=1599,229,6,0;MQ=59 GT:PL 0:0,255 0:0,255 .:0,0 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 1:126,17 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,63 0:0,255 0:0,255 0:0,255**
NC_num 5232949 . C G,T 999 . DP=3540;VDB=0.267565;SGB=410.948;RPB=0.979846;MQB=0.999905;MQSB=0.99953;BQB=0.999963;MQ0F=0;AC=24,5;AN=33;DP4=268,289,1378,1322;MQ=59 GT:PL 1:255,0,255 1:255,0,255 1:37,0,37 2:255,255,0 1:255,0,255 1:255,0,255 1:255,0,255 1:255,0,255 0:0,255,255 1:255,255,255 1:255,0,255 1:255,0,255 1:255,0,255 1:74,0,74 1:255,0,255 1:255,0,255 1:255,0,255 0:0,255,255 1:255,0,255 2:255,255,0 1:255,0,255 2:255,255,0 2:255,255,0 1:255,0,255 1:255,0,255 0:0,255,255 1:255,0,255 2:255,255,0 1:255,0,255 1:150,0,150 1:255,0,255 1:255,0,255 0:0,255,255
NC_num 5233099 . C T 212 . DP=3744;VDB=0.000870234;SGB=21.4718;RPB=0.848604;MQB=0.000274995;MQSB=0.995943;BQB=0.953967;MQ0F=0;AC=1;AN=33;DP4=1869,1811,8,14;MQ=59 GT:PL 0:0,255 0:0,255 0:0,38 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 1:255,0 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255
I have to concatenate only the variant sites detected in the vcf file for each bacterial sample in a fasta format by just stitching all the common SNPs to do an MSA downstream. I am stuck at concatenating the SNPs per sample. For example in the above bold line, REF=A and ALT=T, then in some samples it's retained as an A and others it might be a T. Since it is a haploid, GT of 0 means that in that sample there is an A (same as ref) and GT = 1 means that the sample has a T. Similarly, PL 0,255 means that it's more likely to have the REF allele and a PL 126,17 means that it's more likely to have the ALT allele. And lastly, .:0,0 means no detection of any allele at this position in this particular sample. (AN= 32 i.e out of 33 samples only 32 have an allele detection)
To put it simply, out of all the samples only one has a T and one has no allele detected and others have an A. Am I correct? Is there a tool which does the concatenation per sample for only these SNPs in a fasta format ( I don't want to incorporate these SNPs in any reference sequence instead)?
Sample1
ACC
Sample2
AGT
SampleN
TTC
etc...