Vcf Locations In Consensus Sequence
1
1
Entering edit mode
12.4 years ago
Nupur G ▴ 30

I have a VCF file created by running GATK on read files against a reference genome. The variants in the VCF file have 'locations', these are the locations on the reference genome. Sample lines include

NC_002516.2 92915 . T A 1941.76 PASS AC=2;AF... GT:AD:DP:GQ:PL 1/1:0,80:81:99:1975,240,0 NC_002516.2 192617 . GA G 2562.66 PASS AC=2;AF=... GT:AD:DP:GQ:PL 1/1:0,64:64:99:2605,193,0

I also have a consensus sequence created by vcftools. Which starts off as -

">NC_002516.2 TTTAAAGAGACCGGCGATTCTAGTGAAATCGAACGGGCAGGTCAATTTCCAACCAGCGAT"

What I need though is the variant location on the consensus sequence. So if, from the VCF file, '92915' is the first variant, then this is the location on the reference as well as on the consensus. However, subsequently there are indels. Which will shift the location on the consensus forward and backward. So I need a tool to calculate the variant location on the consensus.

(And then I will need to get annotation data for that region.) Any idea how this can be done please- getting variant consensus locations?

Actually VCFtools is also giving an error, I need to find another utility to create the consensus sequence.... Much appreciated

next-gen variant consensus vcf • 4.7k views
ADD COMMENT
1
Entering edit mode

You mean because of insertions or deletions, the other two comments look like they were thinking about SNPs only, correct? Only alleles of different length can cause deviations in the start location, if no such are contained, then it doesn't matter. If you want to calculate the shifts by inserts, you need to first determine, which allele was chosen for the consensus sequence, reference or non-reference, if non-ref, then shift each location right-of-this by length(noon-ref)-length(ref). I would make a little R script for that.

ADD REPLY
0
Entering edit mode

Thanks for the input. Yes, I think I will write my own script. I just thought there might be a way to do this by an existing tool, as it is possibly a common enough task.

ADD REPLY
0
Entering edit mode

Could be but I don't know any, sorry.

ADD REPLY
0
Entering edit mode

Its not clear to me what you are looking for....can you edit your question to include a sample inputs and expected output you are looking...

ADD REPLY
0
Entering edit mode

Presumably, the consensus sequence is simply the most likely nucleotide at each location of the reference. In that case, your variants will be locations where the consensus does not match the reference. The locations given in the VCF file will match the locations in the consensus since the consensus. If this does not make sense, please edit your question to provide more detail.

ADD REPLY
0
Entering edit mode

You're right - but this is true only for SNPs. Indels will cause shifting.

ADD REPLY
0
Entering edit mode
11.1 years ago

this question remains unanswered, but creating a fasta sequence from a vcf variants file has already been covered in several places like New Fasta Sequence From Reference Fasta And Variant Calls File? or Introducing Known Mutations (From A Vcf) Into A Fasta File.

ADD COMMENT

Login before adding your answer.

Traffic: 1504 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6