Question

Exome Sequencing Depth/Target Considerations And Shared Controls

4

Entering edit mode

12.6 years ago

Ryan D ★ 3.4k

We are proposing to use Agilent SureSelect for whole exome sequencing of four cases with mutations identified in a single gene having 32 exons. Supposing that the SureSelect covers 30 of 32 exons, and they are 75% on target with 90% alignment, all four samples could be done at 100x coverage in a single Hi-Seq lane. The reason for doing so is to show that we can use a NGS method to identify variants normally covered by Sanger sequencing kids in CLIA labs at 10x the cost. If we can, we plan to extend this to an 80-gene panel.

In your considered opinions, is coverage with these parameters sufficient? Further, if we hope to show we can identify 2 or 3 of the 4 mutations in this gene using NGS, are there shared controls available which would be appropriate and available if we wanted to use this sequencing data? Or do you think batch differences between institutions/machines/DNA sample prep differences would instead warrant only using controls sequenced with the same Agilent kit and similar coverage on the same machines at our institution? Better to have solid experiment design before proceeding than have grant reviewers shred us for comparing apples to oranges.

Please let me know your experiences or any relevant publications or edit the above tags. And thanks.

next-gen exome • 4.7k views

ADD COMMENT • link updated 12.6 years ago by Ron128 ▴ 30 • written 12.6 years ago by Ryan D ★ 3.4k

score 4 · Answer 1 · 2012-04-08

4

Entering edit mode

12.6 years ago

Alex Paciorkowski 3.5k

Your coverage estimate sounds about right, and this experimental design is one that I bet is ongoing in several labs right now as NGS moves to replace Sanger methods in the clinical arena. How to validate whole exome sequencing and comparing head-to-head with Sanger is a hot topic. One issue -- as you've alluded to with your 2 uncovered exons -- is how to fill the gaps not covered by exome seq -- designing Sanger "band-aids" to cover these areas does feel like a bit of a nuisance.

As for controls, you will need mutation-negative controls to prove that variants introduced by NGS are recognized by your informatics pipeline and not carried through to final results. You will also need a blinded cohort of mutation-positives and mutation-negatives (unknowns) to prove you can identify them correctly. All of the NGS results will need to be redone by Sanger methods to show validity of NGS compared to "gold-standard" current methods.

I encourage you to consult with friends/colleagues in pure clinical labs to design the best validation techniques, following CLIA guidelines as much as possible.

I would argue all of the NGS work should be done in one institutions, as there are likely to be artifacts introduced that are lab and machine-specific.

A version-tracking workflow tool for your informatics such as Galaxy is a must, and your reviewers will thank you.

A recent review that covers some of these issues is here.

ADD COMMENT • link 12.6 years ago by Alex Paciorkowski 3.5k

1

Entering edit mode

I had not heard that term: Sanger Band-aids. If that is not already coined, it should be.

As I understand it, investigators using this same pipeline will often see a dozen individuals apparently homozygous for a SNP never before seen which then turns out to be a sequencing artifact.

Thanks also, Alex, for pointing me to Galaxy to track this. I use it for a number of other UCSC issues. It would be helpful to get some background using it for NGS workflows. It looks like one is posted here: https://test.g2.bx.psu.edu/u/cjav/w/gatk . Any that are considered better?

ADD REPLY • link 12.6 years ago by Ryan D ★ 3.4k

2

Entering edit mode

I'm not sure how often spurious homozygous variants turn out to be artifact -- but it does happen. Best to filter your variants also through the >5400 exomes available through the NHLBI's exome variant server: http://evs.gs.washington.edu/EVS/

The public Galaxy page now includes a beta GATK install, and the nice folks at Galaxy are really really helpful at helping design custom workflows to meet your needs.

ADD REPLY • link 12.6 years ago by Alex Paciorkowski 3.5k

0

Entering edit mode

Hi Alex, I did not see any tools available or in development that will let us query against the NHLBI exome variant server? Can you point me to any workflows that include this or suggest who in Galaxy would be the contact to get that implemented. It sounds like a great resource. But also the first time I've heard of it.

ADD REPLY • link 12.6 years ago by Ryan D ★ 3.4k

1

Entering edit mode

Hi Ryan - The ESP5400 SNP data can be downloaded from EVS via their "downloads" page at http://evs.gs.washington.edu/EVS/

You can then use local scripts to query that data. EVS data are not integrated into Galaxy afaik, but if you want to email me off-line I can put you in touch with people who are helping our group design custom workflows in Galaxy.

ADD REPLY • link 12.6 years ago by Alex Paciorkowski 3.5k

score 3 · Answer 2 · 2012-04-09

3

Entering edit mode

12.6 years ago

Ron128 ▴ 30

I would think the computational part would be very very important in addition to the coverage as well. What pipeline do you plan to use? We faced the same issue while carrying out exome sequencing of cancer tumours. We narrowed it down to the pipeline used for variant calls, GATK in our case.

ADD COMMENT • link 12.6 years ago by Ron128 ▴ 30

0

Entering edit mode

We will use BWA for alignment to produce BAM files for input to GATK v2, as outlined here: http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v2

ADD REPLY • link 12.6 years ago by Ryan D ★ 3.4k

2

Entering edit mode

But GATK v3 is available! http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v3