Question

Vcf Locations In Consensus Sequence

1

Entering edit mode

12.4 years ago

Nupur G ▴ 30

I have a VCF file created by running GATK on read files against a reference genome. The variants in the VCF file have 'locations', these are the locations on the reference genome. Sample lines include

NC_002516.2 92915 . T A 1941.76 PASS AC=2;AF... GT:AD:DP:GQ:PL 1/1:0,80:81:99:1975,240,0 NC_002516.2 192617 . GA G 2562.66 PASS AC=2;AF=... GT:AD:DP:GQ:PL 1/1:0,64:64:99:2605,193,0

I also have a consensus sequence created by vcftools. Which starts off as -

">NC_002516.2 TTTAAAGAGACCGGCGATTCTAGTGAAATCGAACGGGCAGGTCAATTTCCAACCAGCGAT"

What I need though is the variant location on the consensus sequence. So if, from the VCF file, '92915' is the first variant, then this is the location on the reference as well as on the consensus. However, subsequently there are indels. Which will shift the location on the consensus forward and backward. So I need a tool to calculate the variant location on the consensus.

(And then I will need to get annotation data for that region.) Any idea how this can be done please- getting variant consensus locations?

Actually VCFtools is also giving an error, I need to find another utility to create the consensus sequence.... Much appreciated

next-gen variant consensus vcf • 4.7k views

ADD COMMENT • link updated 11.1 years ago by Jorge Amigo 14k • written 12.4 years ago by Nupur G ▴ 30

1

Entering edit mode

You mean because of insertions or deletions, the other two comments look like they were thinking about SNPs only, correct? Only alleles of different length can cause deviations in the start location, if no such are contained, then it doesn't matter. If you want to calculate the shifts by inserts, you need to first determine, which allele was chosen for the consensus sequence, reference or non-reference, if non-ref, then shift each location right-of-this by length(noon-ref)-length(ref). I would make a little R script for that.

ADD REPLY • link 12.4 years ago by Michael 55k

0

Entering edit mode

Thanks for the input. Yes, I think I will write my own script. I just thought there might be a way to do this by an existing tool, as it is possibly a common enough task.

ADD REPLY • link 12.4 years ago by Nupur G ▴ 30

0

Entering edit mode

Could be but I don't know any, sorry.

ADD REPLY • link 12.4 years ago by Michael 55k

0

Entering edit mode

Its not clear to me what you are looking for....can you edit your question to include a sample inputs and expected output you are looking...

ADD REPLY • link 12.4 years ago by Rm 8.3k

0

Entering edit mode

Presumably, the consensus sequence is simply the most likely nucleotide at each location of the reference. In that case, your variants will be locations where the consensus does not match the reference. The locations given in the VCF file will match the locations in the consensus since the consensus. If this does not make sense, please edit your question to provide more detail.

ADD REPLY • link 12.4 years ago by Sean Davis 27k

0

Entering edit mode

You're right - but this is true only for SNPs. Indels will cause shifting.

ADD REPLY • link 12.4 years ago by Nupur G ▴ 30

score 0 · Answer 1 · 2013-11-12

0

Entering edit mode

11.1 years ago

Jorge Amigo 14k

this question remains unanswered, but creating a fasta sequence from a vcf variants file has already been covered in several places like New Fasta Sequence From Reference Fasta And Variant Calls File? or Introducing Known Mutations (From A Vcf) Into A Fasta File.

ADD COMMENT • link 11.1 years ago by Jorge Amigo 14k