I have a VCF file created by running GATK on read files against a reference genome. The variants in the VCF file have 'locations', these are the locations on the reference genome. Sample lines include
NC_002516.2 92915 . T A 1941.76 PASS AC=2;AF... GT:AD:DP:GQ:PL 1/1:0,80:81:99:1975,240,0 NC_002516.2 192617 . GA G 2562.66 PASS AC=2;AF=... GT:AD:DP:GQ:PL 1/1:0,64:64:99:2605,193,0
I also have a consensus sequence created by vcftools. Which starts off as -
">NC_002516.2 TTTAAAGAGACCGGCGATTCTAGTGAAATCGAACGGGCAGGTCAATTTCCAACCAGCGAT"
What I need though is the variant location on the consensus sequence. So if, from the VCF file, '92915' is the first variant, then this is the location on the reference as well as on the consensus. However, subsequently there are indels. Which will shift the location on the consensus forward and backward. So I need a tool to calculate the variant location on the consensus.
(And then I will need to get annotation data for that region.) Any idea how this can be done please- getting variant consensus locations?
Actually VCFtools is also giving an error, I need to find another utility to create the consensus sequence.... Much appreciated
You mean because of insertions or deletions, the other two comments look like they were thinking about SNPs only, correct? Only alleles of different length can cause deviations in the start location, if no such are contained, then it doesn't matter. If you want to calculate the shifts by inserts, you need to first determine, which allele was chosen for the consensus sequence, reference or non-reference, if non-ref, then shift each location right-of-this by length(noon-ref)-length(ref). I would make a little R script for that.
Thanks for the input. Yes, I think I will write my own script. I just thought there might be a way to do this by an existing tool, as it is possibly a common enough task.
Could be but I don't know any, sorry.
Its not clear to me what you are looking for....can you edit your question to include a sample inputs and expected output you are looking...
Presumably, the consensus sequence is simply the most likely nucleotide at each location of the reference. In that case, your variants will be locations where the consensus does not match the reference. The locations given in the VCF file will match the locations in the consensus since the consensus. If this does not make sense, please edit your question to provide more detail.
You're right - but this is true only for SNPs. Indels will cause shifting.