Calculate Variant Allele Frequency in a TCGA dataset
1
0
Entering edit mode
7.5 years ago
mp85 ▴ 10

Hello everyone,

I have little to no prior knowledge of biology (let's say high school level), but I do have strong machine learning background. A project I am involved in has to do with obtaining predictions from a dataset of tumor samples. One of the predictors we need is Variant Allele Frequency (VAF), so I downloaded one tumor dataset from the TCGA data portal to see how this calculation might be done. I understand what Variant Allele Frequency is in general, however I cannot seem to understand how the calculation is done in practice.

The dataset has the following columns (it has many more in fact, but for my needs I just summarized all numerical columns):

=======================================================================
Statistic         N      Mean         St. Dev.       Min        Max    
-----------------------------------------------------------------------
t_depth           2     84.000         36.770         58        110    
t_ref_count       2     70.000         32.527         47         93    
t_alt_count       2     13.500         3.536          11         16     
n_depth           2     88.500         45.962         56        121     
ALLELE_NUM        2     1.000          0.000          1          1     
TRANSCRIPT_STRAND 2     0.000          1.414          -1         1     
PICK              1     1.000                         1          1     
TSL               2     1.000          0.000          1          1     
MINIMISED         2     1.000          0.000          1          1     
-----------------------------------------------------------------------

What I wish is to add a column, say named vafs, where for each row (each tumor sample) the Variant Allele Frequency is calculated. From my (very basic) understanding, t_ref_count and t_alt_count are the columns that are needed to calculate the Variant Allele Frequency. Is that correct? Do I need to use other columns to perform the calculation? And how precisely this calculation is done?

As an aside, I am going to ask a field expert at some point (I am not going to do all by myself since I lack the knowledge), but I also need to at least grasp how this can be obtained before going any further.

genome R • 5.7k views
ADD COMMENT
5
Entering edit mode
7.5 years ago
igor 13k

Yes, you're correct. VAF is t_alt_count / (t_ref_count + t_alt_count).

When dealing with allele frequencies, also be careful regarding the context. They can be referring to fraction of reads in a single sample (since you are dealing with cancer data, that is probably the case) or fraction of individuals with a mutation in a population.

ADD COMMENT
0
Entering edit mode

+1 for mentioning - "fraction of individuals with a mutation in a population." VAF could be an ambiguous term without proper context.

ADD REPLY

Login before adding your answer.

Traffic: 1792 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6