Question

Faster Illumina Analysis Pipeline Via Streaming

7

Entering edit mode

13.4 years ago

Roman Valls Guimerà ▴ 620

Hello BioStar,

After some time working with Illumina and pipelines, I've identified a bottleneck when getting early "draft" results from a run. Figuring out how the sequencing data looks like in real time, as opposed to wait for the run to finish after ~11 days.

The goal would be to get an estimate of how many reads one could expect to get for each sample, thereby guiding setup for a subsequent run for topping up the data for those samples that do not reach the required amounts. It would be help a lot if we could reduce the the wall-clock time for reaching a decision on which samples need to be re-run.

When it comes to implementation, I've been thinking on a file status daemon such as Guard[1], coupled with the CASAVA/OLB tools from illumina, performing basecalling and demultiplexing as soon as the files get written to disk, without having to wait for the whole run to finish.

Other more high-tech solutions would involve Hadoop Seal[2] and Flume[3] to do the streaming part into Hadoop... but before digging more into this issue I wondered what are you guys doing to get a draft view on a running sequencing run.

Cheers & happy new year Bio* !

[1] https://github.com/guard/guard

[2] http://biodoop-seal.sourceforge.net/installation_generic.html

[3] http://www.slideshare.net/cloudera/inside-flume

illumina analysis next-gen sequencing pipeline • 4.3k views

ADD COMMENT • link updated 13.4 years ago by Luca ▴ 10 • written 13.4 years ago by Roman Valls Guimerà ▴ 620

2

Entering edit mode

You mention demultiplexing. Remember that you need to have the index read for that step.

ADD REPLY • link 13.4 years ago by Sean Davis 27k

1

Entering edit mode

Are you asking about what to expect from the run, the sample or both. If you have uncertainty about the sample quality, you can do a miSeq run to ensure that you have a good prep before doing a long hiSeq run. If you question is only about the data from the hiSeq run itself and not the sample quality, a miSeq pilot won't help.

ADD REPLY • link 13.4 years ago by Casey Bergman 18k

0

Entering edit mode

Yes, that was one of the ideas, detect in realtime whether READ3 has finished and kick-off the draft demultiplexing... makes sense ?

ADD REPLY • link 13.4 years ago by Roman Valls Guimerà ▴ 620

0

Entering edit mode

why not do a miSeq run (if possible) to profile your sample/sample prep?

ADD REPLY • link 13.4 years ago by Casey Bergman 18k

0

Entering edit mode

Casey, I'm not sure what you mean by that profiling, can you elaborate ? Our core facility routinely performs several Hiseq 2000 runs per month, so we have (big) sample datasets to test against.

I was just wondering if someone went through all the trouble to generate a minimal dataset and test a sort of "incremental sequencing run drafting" as described in the question... Was I clear enough on my explanation ?

ADD REPLY • link 13.4 years ago by Roman Valls Guimerà ▴ 620

0

Entering edit mode

Aha, now I get what you meant, yes, the question is about the different samples and the run itself. In other words, we would like to see how different samples get reads incrementally by streaming the demultiplexing step, as the run goes.

Your suggestion by using a faster sequencer (miseq) for QC makes sense, but we're interested on not doing extra lab work, just using the running hiseq experiment and extract its info as soon as it's available.

Thanks again for your feedback Casey !

ADD REPLY • link 13.4 years ago by Roman Valls Guimerà ▴ 620

score 1 · Accepted Answer · 2012-01-02

We're waiting for the whole run to finish before getting any QC information. Optimizing that step like you say would surely be beneficial, and if you can get the base calling on the index read to run as soon as it's been sequenced, then I think the rest would be simple.

Since you only want to get an estimate of the read counts for each of the multiplexed sample, I would try simply using a script to calculate the index read frequencies. You could even exclude reads that have quality problems from your counts. I don't think performance would be an issue since you'd only be dealing with the index reads for one run but, if necessary, you could quickly copy the index reads to a Hadoop cluster and run a custom Pydoop script (or even recycle the Hadoop workcount example) to generate a table (index read, count), which you could join to your sample sheet to get your sample read counts.

HTH.