Entering edit mode
3.1 years ago
adam
•
0
I apologize in advance if this is a silly question - but I am trying to understand how two inherited variants of a gene are represented in typical whole genome sequencing formats (VCF, FASTA/Q).
Here is one example illustrating my confusion, using an SNP VCF from a WGS. Take TAS2R38.. GRCh37.p13 puts the gene's reference location at chr7:g.141672431 - chr7:g.141673573.
Using bcftools, if I call:
bcftools view genome.filtered.snp.vcf.gz 7:141673345-141673345
#CHROM POS ID REF ALT QUAL FILTER INFO
7 141673345 . C G 1434.3 PASS
How could I view the SNP, if any, from the other copy of the gene? And which allele am I viewing when I do the above?
I am not sure I fully understand the question but are you asking how to understand the VCF format?
For this example gene, you have in the VCF the reference base 'C' and the alternative 'G' This describes your two alleles, where they are the same except for this one position.
Typically VCFs are generated by comparison to a single reference genome and therefore all variants/alleles are based on comparisons to this genome, hence the REF version and the ALT which comes from your sample/data used for comparison.
Does that answer the question?
Thanks for your reply. Sort of. In this example as you said I have a nucleotide variant of G where the reference base shows C. However as I understand it, this is only showing a SNP in 1 of 2 inherited genes in a diploid orgasnism. What about the other copy? Is there any way to know if the same SNP, or perhaps a different mutation, exists at the same location on the other copy of the gene? Do VCFs "collapse" the mutations from both copies of the gene into one? Hoping I am explaining this question more clearly.
Hi adam,
your VCF is missing some crucial information. What you have there is what you could call a "variant VCF" - it describes which variant may exist at a given position. It does however miss columns with the genotype.
In a VCF with samples you would have a column per individual, in which the alleles are encoded. The reference allele (C in your example) is encoded with a
0
, the alternative allele (G in your example) with a1
. A heterozygous individual (with one copy of the reference allele and one copy of the alternative allele) would be0/1
. An individual which has two copies of the alternative allele would have1/1
.Most often this genotype is based on counting the individual reads with either the C or G allele. Commonly a genome is sequenced to 30x coverage, meaning that every base is observed 30 times. For a heterozygous variant you would then expect to see 15 times one allele and 15 time the other one, although there is obviously some (Poisson) variation on that and you will get slight deviations from the ideal 50-50 ratio.
So yes, VCF "collapses" both copies of a gene/both alleles of a variant, but you can still figure out the status of both copies.