i have a doubt. i have been analyzing a vcf file and i wanted to check for some of the SNPs if there was correspondence between REF nucleotide in the vcf and the effective nucleotide in the genome from which the SNP was called. In all cases the nucleotide in the specific location i checked was corresponding to the ALT nucleotide in the vcf. Maybe it's a noobie question, but id be very curious to know why (i was expecting a correspondence to REF nucleotide).
-> i started from the: genomeassembly file vcf file a list of interesting SNPs (contig and location) that we identified by SVS to have an high FST index:
The following cmd line scan the SNVs in a VCF and check the REF allele vs the
$ awk -F '\t' '/^#/ || !($4 ~ /^[ATGC]$/) {next;} {printf("%s,%s,%s\n",$1,$2,$4);}' mutations.vcf | while IFS="," read -a A; do echo -n "${A[2]} " && samtools faidx ref.fa "${A[0]}:${A[1]}-${A[1]}" | grep -v '^>' ; done
A A
A A
T T
T T
C C
A A
T T
A A
A A
G G
T T
T T
(...)
ok REF vcf == REF fasta , to my great relief, people won't have to retract their papers :-)
Well -- I don't know when the norm command or the --check-ref option were introduced but the Github history dates back until 2013. Don't you think your version is a little bit outdated?
So the output i got is this for every entry:
[W::vcf_parse] contig 'contig1' is not defined in the header. (Quick workaround: index the file with tabix.)
Your VCF file does not adhere to the VCF specifications. In a proper VCF, every template sequence (i.e., chromosome or contig) must be defined in the header, like exemplified in Section 1.1 of the specification (look for the line starting with ##contig=<ID=20).
Apart from your file not being proper VCF, this is a warning and my guess is that it can be ignored.
they used a reference genome but here in the vcf it says that "Reference allele is not known. The major allele was used as reference allele". So what that means? they did not set a reference genome but still they mapped again a reference and the vcf output in the column REF is imposed by the major allele??
I have no idea - but i guess yes: they somehow inferred variants and used the major allele as reference. Probably, you should contact the people that created the VCF file to ask for the methods...
Apart from that: without knowing the exact reference that was used (i.e., having the same FASTA file) I think bcftools norm will not be able to correctly annotated the reference allele.
yep, i just talked with the person. He created a new VCF from Tassel not defining the reference sequence e choosing the major allele as reference. Solved. Thanks a lot
how have you checked this ? what was your command line.
-> i started from the: genomeassembly file vcf file a list of interesting SNPs (contig and location) that we identified by SVS to have an high FST index:
-> i extracted the first 5 columns from the vcf for the SNPs
where 76832 is the position of the SNP stored in the vcf
The result i get is that the location i find in the contig is always correspondent to the nucleotide in the ALT column.