HiSeq 2500 chemistry enhancements empower the industry’s highest daily throughput and drive down the price of whole-genome sequencing. With support of paired 250 base pair read lengths in rapid run mode, the HiSeq 2500 will be capable of generating up to 300 Gb in rapid mode with sample to data in less than three days. These enhancements will be available in the second half of this year.
What will these much longer read lengths enable in terms of improved analysis?
I think longer reads in general, as long as there isn't an associated increase in error rates, are ultimately good. Even with just the human genome there are still many regions that are hard to do reliable mapping on due to repetitive elements and low sequence complexity. Longer reads can help with those areas (although the reference itself is of course still problematic there). And any overlap between pairs should, theoretically, help with error correction to na extent.
Many repeats are still mappable given paired-end reads. With the standard 2*100bp, about 94-95% of human genome is callable. 2*250bp may not give you a big improvement. De novo assembly will greatly benefit from longer reads. Although in theory we can also use PE reads to assemble through Alu, in practice few (if any) assemblers are really working this out. Overlapping PE reads in ~400bp is much preferred for de novo assembly.
I agree that things will only be slight improvements for what I listed. I would argue that there is still a decent portion of that callable percentage that can still be problematic. We see this routinely even working with targeted Exome data. 250 bp reads probably won't improve much, but they may improve it slightly in some of these regions. I see the longer reads being of far more use for RNA-Seq and de novo assembly though.
As long as the 3' base quality stays high enough to use close to 250bp, it seems you would have less ambiguous placement of split reads for RNA-seq. It also seems like you could select for ~300bp fragment sizes in your library and develop methods to detect base miscalls vs. PCR errors using even the lower quality 3' overlapping sequence. I'm not sure if in-read indel detection is restricted by discovery (number of reads containing mappable sequence flanking an indel) or whether it is computationally prohibitive to consider the gapped alignments.
It just occurred to me (and perhaps this is common knowledge) that when detecting structural rearrangements longer read lengths can have have detrimental effects.
Say you had a 100 bp translocation, you could easily identify that with 50 bp paired end reads having an unexpected insert size, they will still map inside the region. But if you had 250bp reads that would cover the entire region then none of them would map anymore and that will lead to a hole in both location. That is less information than before.
I think what this really means is that, instead of mate-pair mapping used to detect translocations, we would need to move toward a split read approach. For DNA sequencing all of the unmapped reads could potentially span translocation junctions and could be mapped using multiple random seeds from each read. If you have seeds that map and can be expanded such that two distantly mapped seeds uniquely expand to encompass two halves of the entire read, then you can still directly detect the translocation.
or perhaps pursuing the assembly - once the reads are sufficiently long this becomes more reasonable.
further musings: the since the insert sizes don't seem to be growing much (if at all) we may be evolving towards the situation where the reads run into one another and will start to overlap perhaps even fully. That's another characteristic that could be ripe for new techniques.
@matt shirley: +1 for your comment on split-read analysis methods. Even with 100bp reads, using split-read approaches are practical; 250 bp reads most likely make them necessary.
Yes. Same thing for the 18S rRNA. The often targeted V4 region is about 400 bp and should be covered by 250 bp-pairs. But I am worried about the taxa with longer V4 regions: they might dissapear from our radars if we switch from 454 to Illumina.
Many repeats are still mappable given paired-end reads. With the standard 2*100bp, about 94-95% of human genome is callable. 2*250bp may not give you a big improvement. De novo assembly will greatly benefit from longer reads. Although in theory we can also use PE reads to assemble through Alu, in practice few (if any) assemblers are really working this out. Overlapping PE reads in ~400bp is much preferred for de novo assembly.
I agree that things will only be slight improvements for what I listed. I would argue that there is still a decent portion of that callable percentage that can still be problematic. We see this routinely even working with targeted Exome data. 250 bp reads probably won't improve much, but they may improve it slightly in some of these regions. I see the longer reads being of far more use for RNA-Seq and de novo assembly though.