Entering edit mode
7.8 years ago
abedkurdi10
▴
190
Hello all, I need to know if there is a way to estimate the minimum/optimal number of samples needed to detect a variant from panel sequencing (exome sequencing of a set of genes).
I appreciate your help, thank you.
Your question is unclear. The number of samples required to find a variant is 1, if that sample has a variant in your region of interest. If you want to know the minimal number of samples to detect at least one variant, that would depend on how large the target region is that you are sequencing and how polymorphic the region is.
I'm not sure what you really want to ask, but the question as you posted it here is at least ambiguous.
Exactly. And how polymorphic the population in question is. And whether you mean a novel variant, or any variant. If the region, ad you said, harbors polymorphisms, then the probability you'll pick up a variant in a single sample, at minimum, is the minor allele frequency of the polymorphism in question in the population your individual comes from.
I suspect the question is more along the lines of a power calculation, where you are looking for the probability of finding the causal mutation for a mendelian or complex genetic disorder in a cohort/family of patients given you are doing exome sequencing. Maybe not that question exactly but along those lines. In which case the answer isn't straightforward anyway. And the answer is: sequence as many samples as you can reasonably afford.
Indeed, it is about power calculation. I need to identify (known and de-novo) mutations for low penetrance genes in cancer patients using panel sequencing of 30 to 50 genes maximum. Assuming we can achieve an average coverage of 100x, we were wondering if there is a formula/rule that we can use to estimate the minimum number of patients and control needed to reliably associate mutations to an environmental factor (i.e exposure to a certain carcinogen, …)
Many thanks and sorry for the ambiguity.
Since you are investigating cancer genetics, are you looking for somatic or germline variants? Because that too will play a role here.
With 100x you should be able to detect (almost) each germline variant present in your targeted region. So it's not a question of "can I detect the variant". In 30 to 50 genes you would find variants, definitely. You can get an idea of variants and their frequency from ExAC and/or gnomAD.
But since you want to do an association study, this all depends on the effect size of the variants you think you are going to find.
Thanks for the fast reply. Actually, we are looking for germline variants. So, basically, it boils down to the effect size in this case. For this, our biostatistician is going to help. However, I am curious to know, and maybe other readers of this post, on what would be the case if we are looking for a somatic variant with a relatively low frequency!!
Your chance of detecting somatic variants with low frequency is (apart from accurate tissue biopsy) proportional to your coverage. So given a frequency of 20% you would expect that roughly 20 reads out of 100x have that variant. Note that it will often not be exactly 20 reads, since you are sampling molecules from a mixture (Poisson distribution I guess). If the frequency is much lower, say 1%, you'll see that you need a higher coverage to get sufficient chance of finding the mutant allele.
That said, I'm not working in cancer genetics, so there will be people/papers/... with a more accurate answer to that.
This is correct. On the clinical side with panel sequencing, we shoot for 500x-1000x coverage in order to identify low-frequency variants (<10%) in our samples. For an idea of the number of samples needed to identify recurrent mutations in these cohorts look to the TCGA. You're usually talking about sequencing cohorts of hundreds of patient samples. For very rare cancers and more targeted sequencing you may do less (<100) but you have very little power. It's just a matter of something being better than nothing in terms of data. Even if you don't have statistical power.
I'll add on that since you have a biostatistician they are presumably used to doing power calculations. There is nothing special about doing this by NGS and exome sequencing than any other method.
Are you looking for a specific variant?
No, I am not looking for a specific variant.