Introduction
In very simple terms, current sequencing technology begins by breaking up long pieces of DNA into lots more short pieces of DNA. The resultant set of DNA is called a "library" and the short pieces are called "fragments". Each of the fragments in the library are then sequenced individually and in parallel. There are two ways of sequencing a fragment - either just from one end, or from both ends of a fragment. If only one end is sequenced, you get a single read. If your technology can sequence both ends, you get a "pair" of reads for each fragment. These "paired-end" reads are standard practice on Illumina instruments like the GAIIx, HiSeq and MiSeq.
Now, for single-end reads, you need to make sure your read length (L) is shorter than your fragment length (F) or otherwise the sequence will run out of DNA to read! Typical Illumina fragment libraries would use F ~ 450bp but this is variable. For paired-end reads, you want to make sure that F is long enough to fit two reads. This means you need F to be at least 2L. As L=100 or 150bp these days for most people, using F~450bp is fine, there is a still a safety margin in the middle.
However, some things have changed in the Illumina ecosystem this year. Firstly, read lengths are now moving to >150bp on the HiSeq (and have already been on the GAIIx), and to >250bp on the MiSeq, with possibilities of longer ones coming soon! This means that the standard library size F~450bp has become too small, and paired end reads will overlap. Secondly, the new enyzmatic Nextera library preparation system produces a wide spread of F sizes compared to the previous TruSeq system. With Nextera, we see F ranging from 100bp to 900bp in the same library. So some reads will overlap, and others won't. It's starting to get messy.
The whole point of paired-end reads is to get the benefit of longer reads without actually being able to sequence reads that long. A paired-end read (two reads of length L) from a fragment of length F, is a bit like a single-read of length F, except a bunch of bases in the middle of it are unknown, and how many of them there are is only roughly known (as libraries are only nominally of length F, each read will vary). This gives the reads a longer context, and this particularly helps in de novo assembly and in aligning more reads unambiguously to a reference genome. However, many software tools will get confused if you give them overlapping pairs, and if we could overlap them and turn them into longer single-end reads, many tools will produce better results, and faster.
The tools
Here is a list of tools which can do the overlapping procedure. I am NOT going to review them all here. I've used one tool (FLASH) to overlap some MiSeq 2x150 PE reads, and then assembled them using Velvet, and the merged reads produced a "better" assembly than with the paired reads. But that's it. I write this post to inform people of the problem, and to collate all the tools in one place to save others effort. Enjoy!
PEAR (Paired-End Read Merger) http://sco.h-its.org/exelixis/web/software/pear/doc.html
COPE (Connecting Overlapping Paired End reads) http://sourceforge.net/projects/coperead/
SeqPrep https://github.com/jstjohn/SeqPrep
FLASH (Fast Length Adjustment of Short Reads to Improve Genome Assemblies) http://www.cbcb.umd.edu/software/flash
fastq-join (part of ea-utils) http://code.google.com/p/ea-utils/wiki/FastqJoin
PANDAseq https://github.com/neufeld/pandaseq
stitch (now defunct, merged into PANDAseq) https://github.com/audy/stitch
mergePairs.py http://code.google.com/p/standardized-velvet-assembly-report/source/browse/trunk/mergePairs.py
Features to look for
Keeps original IDs in merged reads
Outputs the un-overlapped paired reads
Ability to strip adaptors first
Rescores the Phred qualities across the overlapped region
Parameters to control the overlap sensitivity
Handle .gz and .bz2 compressed files
Multi-threading support
Written in C/C++ (faster compiled) rather than Python/Perl (slower)
This was originally posted by Torsten Seemann to his blog The Genome Factory. This was back in 2012, so some of the recommendations may be out of date. Two more recent tools worth looking at are leeHom and AdapterRemoval v2.
Thanks Genomax. I used bbmap toolkit a lot, and I agree it's really great. But the bbmerge merge does not work with reference to merge the reads, right? My concern is that some reads will have small overlaps, and thus the overall fidelity of this approach will not be as high as I could be with reference. But I will give it a try!
bbmerge
has plenty of options. You can play with them and see if you are able to get the kind of overlaps you are looking for.This is a nice overview :) - You should also take a look at BBMap, which is usually very good at these sorts of read manipulation things.
Specifically
bbmerge.sh
from BBMap. @Brian has an extended post available here.Is there a tool to combine overlapping PE read that uses reference alignment for merging? I only found aftermerge but had no success due to CIGAR string problems (Merging overlapping mates in a BAM / SAM file into one read.)
This is not a commonly used analysis method that is why there is a dearth of tools. You should try
bbmerge
out on your original data. You may be pleasantly surprised.Take a look to this article: "Optimizing Information in Next-Generation-Sequencing (NGS) Reads for Improving De Novo Genome Assembly"
Liu T, Tsai C-H, Lee W-B, Chiang J-H (2013) Optimizing Information in Next-Generation-Sequencing (NGS) Reads for Improving De Novo Genome Assembly. PLoS ONE 8(7): e69503. doi:10.1371/journal.pone.0069503
Authors presents the ARF-PE tool, wich looks amazing but seems to be discontinued.
NGmerge (2018) is another option. According to the paper, it performs better than other popular tools like FLASH and PEAR, particularly with respect to the estimation of quality scores for consensus bases.