Hello experts,
I would like to preface this post by saying that I'm a biochemical geneticist by trade and primarily work with biochemical pathways and biomarker detection. I have zero coding experience and my limited molecular genetic experience is exclusively in variant curation.
I have a keen interest in mitochondrial disorders, so when the opportunity arose to work on a project to help establish my facility as a center for WGS anaylsis, I jumped on it (without really thinking!). I have a VCF file from NextGENeV2.4.1 (I'm currently working with this in excel) that does not contain mutations in the mitochondrial genome. I would like to manipulate the VCF file to include about a dozen or so common mitochondrial mutations to be annotated using Alissa 5.3.
Question: 1. Are there any resources available (I've found this: http://samtools.github.io/hts-specs/VCFv4.2.pdf) that can tell me what I'm looking at in the current VCF file and what it means? My biochemical brain only sees letters and numbers.... What data is critical to have to feed into Alissa? 2. Is Alissa the best platform to be using for mitochondrial genomes? If not, what are other suggestions?
I'm feeling very out of my depth here.
Thank you!
The VCF specification document is the go-to resource to understand VCF files. I can try and simplify it a little:
##
are header lines with meta-information. These lines describe the information contained in the VCF file. If you imagine a table with headers, this part would describe what the column names actually mean.#CHROM
is kind of like the table header - there are 8 fixed columns: CHROM, POS, ID, REF, ALT, QUAL, FILTER and INFO. Columns 9, called FORMAT, describes the format that coolumns 10 on would take. Columns 10 on are one column per sample described in the VCF..
if no existing variant matched to that location.##FILTER
lines in the meta-informationkey1=value1;key2=value2;...
). Check out the##INFO
lines for description on what each of the keys mean.:
, every subsequent column will have 5 values separated by:
. and the hth value in this column will be the header for the ithvalue in subsequent columns.The spec has examples that will help understand the format better. A quick summary would be:
An important FORMAT field is the
GT
field, which gives us the genotype of the change. Here,0
is the REF allele, and other numbers are ALT alleles in the order listed in the ALT field. So, for a diploid organism,0/0
is hom-ref;0/1
is heterozygous and1/1
is homozygous mutant.I hope I haven't confused you more than the spec doc.