Hello everyone,
I have little to no prior knowledge of biology (let's say high school level), but I do have strong machine learning background. A project I am involved in has to do with obtaining predictions from a dataset of tumor samples. One of the predictors we need is Variant Allele Frequency (VAF), so I downloaded one tumor dataset from the TCGA data portal to see how this calculation might be done. I understand what Variant Allele Frequency is in general, however I cannot seem to understand how the calculation is done in practice.
The dataset has the following columns (it has many more in fact, but for my needs I just summarized all numerical columns):
=======================================================================
Statistic N Mean St. Dev. Min Max
-----------------------------------------------------------------------
t_depth 2 84.000 36.770 58 110
t_ref_count 2 70.000 32.527 47 93
t_alt_count 2 13.500 3.536 11 16
n_depth 2 88.500 45.962 56 121
ALLELE_NUM 2 1.000 0.000 1 1
TRANSCRIPT_STRAND 2 0.000 1.414 -1 1
PICK 1 1.000 1 1
TSL 2 1.000 0.000 1 1
MINIMISED 2 1.000 0.000 1 1
-----------------------------------------------------------------------
What I wish is to add a column, say named vafs
, where for each row (each tumor sample) the Variant Allele Frequency is calculated. From my (very basic) understanding, t_ref_count
and t_alt_count
are the columns that are needed to calculate the Variant Allele Frequency. Is that correct? Do I need to use other columns to perform the calculation? And how precisely this calculation is done?
As an aside, I am going to ask a field expert at some point (I am not going to do all by myself since I lack the knowledge), but I also need to at least grasp how this can be obtained before going any further.
+1 for mentioning - "fraction of individuals with a mutation in a population." VAF could be an ambiguous term without proper context.