To answer question 1:
A better way to think of it is that each row in the VCF file is a specific variant location in the genome. For a SNP, there's only 4 possible outcomes for the nucleotide - A,C,T or G. One of these nucleotides is going to be the reference, so only 3 options in the alternate allele column, there may be more than one alternate allele. This would still be classed as the same variant. Sometimes you can have multiple variants at the same location because they are identified by different databases.
Only these columns are necessary in a VCF file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
Columns after these are non-fixed field as RamRS points out, and commonly contain genotype data, where an individual (e.g. NA00001) is either homozygous for the reference allele 0 | 0, heterozygous 1 | 0 or 0 | 1, or homozygous for the alternate allele 1 | 1, or 2 | 2 if more than 1 alternate allele for example. .
Here's an example for a 1000 genomes variant rs11725853 for a select few individuals (called HG#####):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00105 HG00106 HG00107 HG00108 HG00109
4 175642699 rs11725853 G C,A 100 PASS GT 0|0 0|0 0|2 2|0 0|0 0|1 0|0 1|1 2|1 2|1 0|2 2|2
To answer question 2:
The ID column as stated in the VCF specifications that RamRS linked to state:
"ID - identifier: Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant the rs number(s) should be used. No identifier should be present in more than one data record. If there is no identifier available, then the MISSING value should be used. (String, no white-space or semi-colons permitted, duplicate values not allowed.)"
You can actually put anything you want in this field, you can put your own labels in. Sometimes more than one variant is present at a location, e.g. a dbSNP germline variant and a COSMIC somatic variant. They would both be listed here separated by commas. These are not for the individuals, as we see above, these are listed as columns after the standard VCF file header columns.
Hello sanna.aizad,
These seem like assignment question, are they assignment questions? Have you tried reading the VCF specifications?
Yes, I have read the VCF specs several times and I still couldn't figure these out.
Are these assignment questions though?
Also, can you please explain to me your understanding of what a variant is? This understanding is critical to answering your first question.
The VCF format can be understood by thinking of it as a 3D matrix: Each row is a variant, each non-fixed field is a sample, and each intersection of variant-sample "cell" is a matrix describing the nature of the specific variant in the specific sample.
This is an example from the specifications doc:
The non-fixed field here is
NA00001
and the variant entry is for position1:2827694
. As you can see, the matrix at the intersection is as follows:The
GT
andGQ
part is obtained from theFORMAT
entry, and is uniform across samples.Thank you for this. I am trying to understand what a variant is, so I don't understand what you mean by assignment.
I have tried to represent the 3D matrix into 2D to understand it better. I have used the example from the VCF specs v4.1.
I have a feeling I may have gotten the Alts wrong. But here is what I have understood:
https://ibb.co/6R0bK3f
Looks about right, the confusion you have with "which of the two numbers is the ref allele" is that it's always the
0
that's the ref allele. Usually, in unphased VCF files1, you'd see heterozygous genotypes as0/1
. However, phased VCF entries can show1|0
(note the|
pipe symbol as opposed to the/
forward-slash). This means that the genotype for that individual is heterozygous, and that the first allele (1
) was derived from the father and the second allele (0
) from the mother. It is adding the parental information to the zygosity to get phasing information across.1: This could change for multi-allelic variants, and will change for non diploid cells (see specs below for example). An unphased entry for a biallelic variant in a diploid organism is easier to understand, and other cases start adding layers of complexity.
From the specs doc:
By the way, see How to add images to a Biostars post to add your images properly. You need the direct link to the image, not the link to the webpage that has the image embedded (which is what you have used here)