I came across a lot of literature that uses 30x reads as a good coverage depth for NGS. But then I read up on Broad's NGS intro where they used "high-pass: 30x; low-pass: 4x; exon capture:150x". Not sure when the intro note was written, so no idea how relevant those numbers are to HiSeq 2000 machine.
Q1. how do they come up with those numbers? (anything to do with their Bayesian model-based SNP calling ?)
Q2. how are those numbers related to detecting rare SNPs in a small pool of sample? Say, I want to find out a SNP whose population-based occurrence is between 0.1-0.01%. I only have a handful of individual samples (say 20). Should I adjust the coverage depth (say, from 30x to 100x) in order to detect with high confidence the SNP of interest?
Some digging around in google and pubmed does not really help me find the answers. Any help/reference will be greatly appreciated. Thank you.
not sure if my clarification is needed as the "consensus" seems to be, as Obi pointed out, less about statistics but more about cost effectiveness. But here is my real situation -
not about tumor samples, but about an uncommon (but not rare) hereditary disease. we are to sequence only patients and their immediate family members. In this case, my understanding is "sample size" may not be as important as we are directly targeting patients population, who presumably carry the mutations of interest. Is it correct to assume in this case, coverage depth is not as a big issue as that in a population-based study?
if we are to do selected exon/target sequencing, as opposed to WGS, is it correct that we can lower the coverage depth but still identify uncommon/rare SNPs ? My understanding is that WGS needs to have a higher coverage to detect rare SNPs (in coding regions) because sequencing technology has an intrinsic bias toward more coverage of non-coding regions. Is this correct? Not sure if I should re-post it as a new question or just leave it here as a comment. Regardless, anyone who has any thoughts to share is welcome to have his/her ideas posted here. Thanks all.
This clarification is great. And, it changes things substantially.
Since you are sequencing affected individuals and their family, you will hopefully have a good chance of detecting your variants of interest. At least you are less likely to miss them because they just aren't in your population. So, you are correct that "sample size" is not as big a problem. But, I'm not sure that coverage/depth is less or more of an issue in this kind of study vs a population based study. Either way, if you are sequencing individual patients (from one family or a big population) the question is whether you will be able to confidently detect the important variants in any given individual. The number of individuals you sequence isn't a factor. For such a study I'm guessing you will be obtaining fresh (or appropriately stored) blood samples and will be able to get good quality DNA. This is probably more important than anything and means that you may be able to get away with less total depth of coverage without having to worry about the tumor purity/heterogeneity issues that we have.
If you do exon targeted sequencing you will only get sequence (more or less) for exon regions. That is fine if you want to assume that your important disease-related mutations are only going to be in coding regions. WGS would give you sequence data also for non-coding regions which could be important (see the recent papers in Science on the likely importance of TERT promoter mutations Melanoma). However, many people make this simplifying assumption for cost reasons. You can cover the exome at a depth sufficient for confident variant calls with much less total reads than it would take you to get the same level of coverage and depth for the whole genome. WGS by definition covers more of the genome (since it is not limited to the exome) so you would need more total reads to get the same average depth of coverage. But, I would not say that it has intrinsic bias toward more coverage of non-coding regions. That would seem to imply that it is easier to get reads for non-coding regions than for coding regions. I am not aware that this is an issue with WGS. Anyways, if you go the exome route, I would still target the same average coverage. For example you might target 30-50X for at least 80% of exome (instead of 30-50X for at least 80% of whole genome). But, you will be able to do this with a lot less reads (0.5-1 lanes of Hiseq2000 instead of 3-5 lanes). Part of your confusion might be surrounding the unfortunate multiple and confusing uses of terms such as coverage and depth in NGS. If you read What is the sequencing 'depth' ? it might help.