Entering edit mode
7.2 years ago
spanakorice
▴
10
Dear colleagues,
I would be interested to know if there is a publication or other reference for NGS statistics.
The NGS guidelines published in 2016 (PMID: 26508566) provide an estimate of probability to miss a variant in different situations, namely considering the % of bases covered >20x. Eg. the following situations 75% of target bases >20x, 86% >20x, 96% >20x.
I was wondering if there are other relevant articles to read and what would be the expected percentage of bases >20X for a 65X and 130X (mean coverage) exome.
Thank you in advance!
Peter
You might be able to get a useful value by also incorporating the standard deviation, some other measure of variance, and/or additional variables, but not with a single universal closed-form equation using the coverage alone. Different library preparation, amplification, and sequencing platforms greatly effect the shape of the coverage distribution. From Illumina single cell sequencing, at 65x coverage, you might get 80% of the genome with 0 coverage; while with unamplified PacBio isolate data, you may get 0 bases under 20x. Exome capture will fall between those and depends on the specific kit, which will have variable region-specific capture efficiency (and incur additional ref-bias anyway). I encourage you to test it yourself or look at existing results for the exome kit you want to use; subsample it to 65x or 130, map it, and look at the coverage distribution over baits.
Would anybody have an idea?
It's just that different options eg. 65x and 130x are often offered from the same lab for different prices (the latter being 50% more expensive). For the 130x mean coverage they give ~96-97% above 20x. So I was wondering what the % of 20x coverage would be for the 65x exome. And if this can be also applied for other solutions offered by other labs (75x, 200-250x, etc.)
Hi Peter, unless you're analysing cancer data, you don't need 130x. Go for the 65x.
If you process the data correctly, you can ensure almost complete pick-up of germline variants at as low as 18x, provided your panel is well-designed too, with primers that map uniquely to the genome.
My own validation data from a carefully constructed analytical pipeline that's currently being used in a clinical genetics laboratory has the following sensitivity to Sanger-confirmed variants:
That's interesting. Can you describe how you came to that result?Is this the mean coverage or the minimum coverage at every target position? And as I always ask if I read about coverage: How do you calculate the coverage?
fin swimmer
This is read-depth at each variant position in the genome. It's just reporting sensitivity (not specificity) to Sanger sequencing over 13 samples selected as validation samples. The data is from around 350 Sanger-confirmed variants in these samples. Also, the panel is reasonably small (~30-40 genes I believe). At 18x, for example, specificity (false-negative) may be quite poor.
Thank you Kevin. I am afraid it is not that simple.
There was a recent talk at the ESHG meeting from A. Rauch citing the NGS guidelines I mentioned above. The following stats are from the guidelines.
gif image host
So the discussion was that - in the clinical setting - it is better to "throw" your money rather than ask for a low coverage exome.
In addition to this, mosaicism has been described for ~6.5 of DNMs and could account for 4-8% of patients with autism spectrum disorder.
Practically for the 130x exome the lab I am collaborating gives a >95% of target bases at 20x. So I was wondering what the % of target bases covered 20x would be for the 65x exome.
And further to this if it can be applicable to all other 65x exomes (by other labs). Or how to compare with the coverage proposed by others (75x, 200x, etc).
Hey, yes, I wish that it was simple (but as you've implied, it's not!).
Just to be sure: when you use 'missing' in your post here, you are referring to bases that are just not sequenced? As it's an exome panel there will always be regions that suffer low coverage due to sequence similarity with other regions in the genome. In these cases, aiming for a higher depth of coverage can indeed help, but your coverage profile will look like a roller coaster track.
From my data, I was implying that, once you do actually have coverage and you have carefully processed the data, there are clever ways of ensuring complete pick-up of variants. Even in the 18x 99.5% sensitivity that I mentioned above, I have since learned analytical ways that can 'find' the 0.5% variants that were not encountered in this sample.