genotypes - somatic data
1
0
Entering edit mode
24 months ago
pingu77 ▴ 20

Hi all,

I am reading the github page of DeepVariant, and I don't understand the meaning of this sentence:

"For somatic data or any other samples where the genotypes go beyond two copies of DNA, DeepVariant will not work out of the box because the only genotypes supported are hom-alt, het, and hom-ref."

I think I am missing the biology of this statement. Could you please help me to understand it? What are the other options for the genotypes in case of somatic data?

Thank you and sorry for the stupid question.

genotype caller data somatic variant • 902 views
ADD COMMENT
3
Entering edit mode

In case of cancer and somatic variants, you encounter the scenarios of polyploidy and heterogeneity of the sample. Polyploidy refers to copy number alterations in the tumor DNA, where for example, fragments or even whole chromosomes can be duplicated or deleted, and therefore, the tumor DNA is not diploid anymore (two copies of DNA). Heterogeneity refers to the fact that usually, in cancer samples, there are different subclones with slightly different genetic variants each. Therefore, when you sequence these heterogeneous samples (WES, WGS, RNA-Seq.. "bulk" sequencing), and run variant calling, you will detect variants at different ratios (depending on how many tumor cells harbour the given variant). You will see these variants represented at different variant allele frequencies (VAFs). In "normal" (germline) variants however, the VAF is either 0, 0.5 or 1.0 (approx), representing hom-ref, het, and hom-alt, respectively.

ADD REPLY
4
Entering edit mode
24 months ago

I obviously lack some context here, but just based on this one sentence, I would interpret it as follows:

The tool only supports (is trained on) data in which the reference base is present on both alleles (e.g. A/A), is heterozygous (e.g. A/G) or homozygously alternative (e.g. G/G) - that is allele frequencies of 0, 0.5 or 1.

In somatic data, mutations accumulate over time, so unless it is a single-cell sequencing, you might find more than two genotypes for a particular SNP, e.g. 60% C, 35% T and 5% A or whatever.

The "any other samples where the genotypes go beyond two copies of DNA" refers to any cell, that is not or no longer diploid. This may happen in humans through Copy Number Variations (CNVs) in e.g. cancer. In other organisms, a different ploidy is quite common: The Marbled crayfish for example is triploid (has three copies of its genome in each cell) and plant genomes are notorious for having high ploidies (some cultivars of berries up to >25, I believe). Some flies such as Drosophila are capable to selectively amplify chromosomes (polytene chromosomes) and can generate a few thousand copies within a single cell.

All of those cases obviously are very challenging for variant calling, as it is unclear if an infrequent base call represents a sequencing error or indeed a valid biological occurrence.

ADD COMMENT

Login before adding your answer.

Traffic: 2501 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6