Which Column Of A Vcf File Indicates The Reference Allele?
4
3
Entering edit mode
12.2 years ago

Hi,

I have something like a VCF file as below :

 chr     position           A1      A2
    16      85955663        G       A
    16      85955671        A       G
    16      85955948        A       G

The first column is the chromosome number, the second column is the position, the third column is A1 and the fourth column is A2.

I am unable to figure out if A1 is the reference allele or if A2 is the reference allele.

Is there a way I can find this out ? These are human SNPs.

vcf reference allele • 7.3k views
ADD COMMENT
0
Entering edit mode

did these come from a microarray or from sequencing? (could be TOP/BOT nomenclature)

ADD REPLY
0
Entering edit mode

These came from sequencing.

ADD REPLY
6
Entering edit mode
12.2 years ago

If you know the genome build, you could download it from UCSC or NCBI and compare a few alleles from A1 and A2 to the reference to answer your question. If the file you have follows the VCF specification, then A1 is the REF allele:

 #CHROM POS     ID        REF    ALT     
20     14370   rs6054257 G      A       
20     17330   .         T      A       
20     1110696 rs6040355 A      G,T
ADD COMMENT
0
Entering edit mode

I tried doing this, however, for some of the positions, the reference nucleotide is not in any of the two columns. Does this mean that neither of the two columns are reference ?

ADD REPLY
0
Entering edit mode

This could mean that you have the wrong reference genome. How many of the other sites match the reference genome that you have?

ADD REPLY
0
Entering edit mode

Well approximately only 15-20% of the sites match. I re-checked and I am sure I am using the correct reference genome. This sequencing experiment was done using hg19 build of the human genome and I have used the NCBI hg19 Reference genome for the comparison. Could there be some heterozygous mutations in the reference used for sequencing ?

ADD REPLY
1
Entering edit mode

Based on what you describe, are you sure that A1 and A2 are not the genotypes of individual samples? The VCF spec allows multiple individuals in one file. If you have more than one individual, there will be instances where the genotype of one of the samples will match the reference when the other sample is variant. There will also be positions where both samples share the same variant allele. Would this explain your A1 and A2 columns?

ADD REPLY
0
Entering edit mode

Out of curiosity, which genotyper produced this format?

ADD REPLY
6
Entering edit mode
12.2 years ago

It's very possible that your file does not describe the reference allele at that position, but rather, gives the two alleles identified at that location. If you're looking at somatic mutations, most sites will be heterozygous, and your alleles will be the reference variant and the somatic mutation. (say, G/A). In other cases, you might see a homozygous mutation (A/A), or in rare cases, you might see two mutations at the same site (A/T).

You can use a reference fasta along with samtools faidx to quickly grab the reference allele at any given position, which may help you determine whether your first column is always the reference allele or not.

ADD COMMENT
0
Entering edit mode

Thanks for the reply. I am new to SNPs and so I had this very basic question. For the position 85955663, my reference from NCBI suggests that it should be "T". However, neither of the two columns are "T" for that position as can be seen from my question above. They are "G" and "A". Does this mean none of my columns is Reference ?

ADD REPLY
2
Entering edit mode
12.2 years ago

It is possible, even if this is sequencing related, that the variants are in the TOP/BOT nomenclature typically reserved for Illumina GoldenGate genotyping chips.

http://www.illumina.com/documents/products/technotes/technote_topbot.pdf

TOP/BOT, although I still cannot wrap my head around what it is supposed to do, will someday to be easily understood by aliens or future generations of humanoids-like organisms.

ADD COMMENT
0
Entering edit mode
12.2 years ago

In a real vcf, the columns are labeled "REF" and "ALT" so it's no mystery. I'd check with whoever prepared that table, becasue if NCBI says that the ref is neither of those letters, you might not be looking at the right reference, or the SNP calling was done against a slightly different reference.

While it's possible that the letters in your table are two alternate alleles, I don't think there are many points in the genome that are triallelic like that. You should not have a whole long list of such points.

ADD COMMENT

Login before adding your answer.

Traffic: 1552 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6