Concatenating alleles at detected sites to form fasta sequences from vcf files
0
1
Entering edit mode
7.5 years ago
Mohak ▴ 20

I have a multi-vcf file with various samples of a bacterial genome, in the following format,

NC_num     20      .       A       T       66      .       DP=1850;VDB=0.015032;SGB=16.4938;RPB=0.0719749;MQB=0.782089;MQSB=0.998951;BQB=0.976407;MQ0F=0;AC=1;AN=32;DP4=1599,229,6,0;MQ=59 GT:PL   0:0,255 0:0,255 .:0,0   0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 1:126,17 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,63  0:0,255 0:0,255 0:0,255**

NC_num  5232949 .   C   G,T 999 .   DP=3540;VDB=0.267565;SGB=410.948;RPB=0.979846;MQB=0.999905;MQSB=0.99953;BQB=0.999963;MQ0F=0;AC=24,5;AN=33;DP4=268,289,1378,1322;MQ=59   GT:PL   1:255,0,255 1:255,0,255 1:37,0,37   2:255,255,0 1:255,0,255 1:255,0,255 1:255,0,255 1:255,0,255 0:0,255,255 1:255,255,255   1:255,0,255 1:255,0,255 1:255,0,255 1:74,0,74   1:255,0,255 1:255,0,255 1:255,0,255 0:0,255,255 1:255,0,255 2:255,255,0 1:255,0,255 2:255,255,0 2:255,255,0 1:255,0,255 1:255,0,255 0:0,255,255 1:255,0,255 2:255,255,0 1:255,0,255 1:150,0,150 1:255,0,255 1:255,0,255 0:0,255,255

NC_num  5233099 .   C   T   212 .   DP=3744;VDB=0.000870234;SGB=21.4718;RPB=0.848604;MQB=0.000274995;MQSB=0.995943;BQB=0.953967;MQ0F=0;AC=1;AN=33;DP4=1869,1811,8,14;MQ=59  GT:PL   0:0,255 0:0,255 0:0,38  0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255  0:0,255    0:0,255 0:0,255 1:255,0 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255 0:0,255    0:0,255  0:0,255 0:0,255 0:0,255 0:0,255

I have to concatenate only the variant sites detected in the vcf file for each bacterial sample in a fasta format by just stitching all the common SNPs to do an MSA downstream. I am stuck at concatenating the SNPs per sample. For example in the above bold line, REF=A and ALT=T, then in some samples it's retained as an A and others it might be a T. Since it is a haploid, GT of 0 means that in that sample there is an A (same as ref) and GT = 1 means that the sample has a T. Similarly, PL 0,255 means that it's more likely to have the REF allele and a PL 126,17 means that it's more likely to have the ALT allele. And lastly, .:0,0 means no detection of any allele at this position in this particular sample. (AN= 32 i.e out of 33 samples only 32 have an allele detection)

To put it simply, out of all the samples only one has a T and one has no allele detected and others have an A. Am I correct? Is there a tool which does the concatenation per sample for only these SNPs in a fasta format ( I don't want to incorporate these SNPs in any reference sequence instead)?

Sample1 
ACC

Sample2
AGT

SampleN
TTC

etc...

SNP MSA VCF sequence alignment • 1.5k views
ADD COMMENT

Login before adding your answer.

Traffic: 2543 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6