Hello BioStar,
After some time working with Illumina and pipelines, I've identified a bottleneck when getting early "draft" results from a run. Figuring out how the sequencing data looks like in real time, as opposed to wait for the run to finish after ~11 days.
The goal would be to get an estimate of how many reads one could expect to get for each sample, thereby guiding setup for a subsequent run for topping up the data for those samples that do not reach the required amounts. It would be help a lot if we could reduce the the wall-clock time for reaching a decision on which samples need to be re-run.
When it comes to implementation, I've been thinking on a file status daemon such as Guard[1], coupled with the CASAVA/OLB tools from illumina, performing basecalling and demultiplexing as soon as the files get written to disk, without having to wait for the whole run to finish.
Other more high-tech solutions would involve Hadoop Seal[2] and Flume[3] to do the streaming part into Hadoop... but before digging more into this issue I wondered what are you guys doing to get a draft view on a running sequencing run.
Cheers & happy new year Bio* !
[1] https://github.com/guard/guard
[2] http://biodoop-seal.sourceforge.net/installation_generic.html
You mention demultiplexing. Remember that you need to have the index read for that step.
Are you asking about what to expect from the run, the sample or both. If you have uncertainty about the sample quality, you can do a miSeq run to ensure that you have a good prep before doing a long hiSeq run. If you question is only about the data from the hiSeq run itself and not the sample quality, a miSeq pilot won't help.
Yes, that was one of the ideas, detect in realtime whether READ3 has finished and kick-off the draft demultiplexing... makes sense ?
why not do a miSeq run (if possible) to profile your sample/sample prep?
Casey, I'm not sure what you mean by that profiling, can you elaborate ? Our core facility routinely performs several Hiseq 2000 runs per month, so we have (big) sample datasets to test against.
I was just wondering if someone went through all the trouble to generate a minimal dataset and test a sort of "incremental sequencing run drafting" as described in the question... Was I clear enough on my explanation ?
Aha, now I get what you meant, yes, the question is about the different samples and the run itself. In other words, we would like to see how different samples get reads incrementally by streaming the demultiplexing step, as the run goes.
Your suggestion by using a faster sequencer (miseq) for QC makes sense, but we're interested on not doing extra lab work, just using the running hiseq experiment and extract its info as soon as it's available.
Thanks again for your feedback Casey !