Question

Calculate Variant Allele Frequency in a TCGA dataset

2

Entering edit mode

7.9 years ago

mp85 ▴ 30

Hello everyone,

I have little to no prior knowledge of biology (let's say high school level), but I do have strong machine learning background. A project I am involved in has to do with obtaining predictions from a dataset of tumor samples. One of the predictors we need is Variant Allele Frequency (VAF), so I downloaded one tumor dataset from the TCGA data portal to see how this calculation might be done. I understand what Variant Allele Frequency is in general, however I cannot seem to understand how the calculation is done in practice.

The dataset has the following columns (it has many more in fact, but for my needs I just summarized all numerical columns):

=======================================================================
Statistic         N      Mean         St. Dev.       Min        Max    
-----------------------------------------------------------------------
t_depth           2     84.000         36.770         58        110    
t_ref_count       2     70.000         32.527         47         93    
t_alt_count       2     13.500         3.536          11         16     
n_depth           2     88.500         45.962         56        121     
ALLELE_NUM        2     1.000          0.000          1          1     
TRANSCRIPT_STRAND 2     0.000          1.414          -1         1     
PICK              1     1.000                         1          1     
TSL               2     1.000          0.000          1          1     
MINIMISED         2     1.000          0.000          1          1     
-----------------------------------------------------------------------

What I wish is to add a column, say named vafs, where for each row (each tumor sample) the Variant Allele Frequency is calculated. From my (very basic) understanding, t_ref_count and t_alt_count are the columns that are needed to calculate the Variant Allele Frequency. Is that correct? Do I need to use other columns to perform the calculation? And how precisely this calculation is done?

As an aside, I am going to ask a field expert at some point (I am not going to do all by myself since I lack the knowledge), but I also need to at least grasp how this can be obtained before going any further.

genome R • 6.0k views

ADD COMMENT • link updated 7.9 years ago by igor 13k • written 7.9 years ago by mp85 ▴ 30

score 5 · Accepted Answer · 2017-06-04

5

Entering edit mode

7.9 years ago

igor 13k

Yes, you're correct. VAF is t_alt_count / (t_ref_count + t_alt_count).

When dealing with allele frequencies, also be careful regarding the context. They can be referring to fraction of reads in a single sample (since you are dealing with cancer data, that is probably the case) or fraction of individuals with a mutation in a population.

ADD COMMENT • link 7.9 years ago by igor 13k

0

Entering edit mode

+1 for mentioning - "fraction of individuals with a mutation in a population." VAF could be an ambiguous term without proper context.

ADD REPLY • link 7.9 years ago by poisonAlien ★ 3.2k