Hello
I have only read about DNA sequencing and never seen the actual results from a sequencing project. I'm wondering how heterozygotes and and somatic mutations show up in sequencing results. This is my understanding of a sequencing project
1) extract DNA, typically from blood cells 2) make clone library. There is a formula which works out how many clones you need to make sure all of the DNA of a heterozygous individual is represented in a clone (by all of the DNA i mean both copies of a chromosome) 3) sequence the clones. The sequencing project has an overall coverage. On a genome basis, it means that, on average, each base has been sequenced a certain number of times (10X, 20X...). For a specific nucleotide, it represents the number of sequences that added information about that nucleotide.
If the individual is heterozygous at a loci you will see 2 alleles at that position. You would expect to see each allele in approximately 50% of the sequencing reads. However is it correct that there is no reason stopping your clone library from overrepresenting one chromosome so you do not get a 50:50 distribution of each allele?
Considering somatic mutations. it is possible that one of your blood cells has a spontaneous mutation at a particular locus and it is possible that the DNA fragment from this such blood cell is inserted into a clone libary. Whilst I imagine this is very rare, is it possible? How would this show up in your sequencing results? Lets say a locus has 25x coverage and only one of those reads is a different allele to the others due to your somatic mutation, would it be classed as a sequencing error or would you class the locus as heterozygous? If that locus was already heterozygous you could in theory get 3 alleles there I presume?
thanks a lot
I think things get even more complicated if you factor in ploidy (many cancers are not diploid).
I have been thinking of a formula that computes the cancer cell fraction (CCF) from measured variant allele frequencies (VAF), taking into account known ploidy and purity. Do you happen to know how to do this?
In Figure 1 of Landau et al. (http://www.nature.com/leu/journal/v28/n1/full/leu2013248a.html) they present an example with VAF=0.125, ploidy = 3, purity = 67%, and the resulting CCF=0.5. However, my own naive calculation 0.125 x 3 / 0.67 = 0.56 is slightly off. What am I missing?
Think I figured it out:
Also incase anyone is wondering (took me a minute to notice this) the figure is showing the case where you literally have 3 cells, two are tumor and one is normal. So the purity of .67 is a rounded version of 2/3 in their example. If you do more sig-figs on the purity you get closer to 0.5 using your equation.
Nice, that gives you the MLE. The probability distribution given lowish coverage (used here http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3575604/) can be modeled like this in R:
Very nice! Now I can even put 95% confidence intervals to my CCF estimates.
One bit that may or may not concern you. CCF is not an accurate term for what this number means. Kind of an issue in a lot of these papers. What the number really means is the average number of mutations per cell. This usually doesn't matter, but consider the example where you start with a mutation, then amplify that mutation a few times, and the tumor chromosomes all share the mutation, and they are also in an amplified state. This CCF value will be greater than 1! That is clearly not a fraction...
The code above, as used in the paper mentioned above, only looks at potential "CCF" values between 0 and 1. If you instead relax that restriction and change the line to
CCFs=seq(0.01,3,by=0.01)
, you will see that the maximum can be over 1 in some of these cases.The link of the paper of Landau et al is outdated. Could you please provide the title, year, and first author of the paper?
How to explain a het (0/1) somatic mutation in tumor with VAF higher than 50%? I saw some of these cases in the TCGA vcf files. could it be the sample has some portion of homozygous mutations but the majority is heterozygous?