I have longer (150) PE read data of DNAseq and would like to find seq variation as well as align to known genome of bacteria. The problem is these longer reads are overlapping, so what I am looking for is a GUI based tool which can convert these PE to single end and then I can align to reference genome (~6K). I am also open to any other suggestion. Can I use Bowtie 2 or BWA-SW to such longer reads for alignments. Thanks
Why is it a problem that the forward and reverse reads overlap? Is it an issue that the overlapping part has more representation than the outer ends of the reads?
Yes, that's the typical worry. If you assume there was (say) a substitution error early on in PCR amplification, that would be propagated to all copies of the molecule on the flowcell. If that error was in the overlapping portion, both reads would see it and subsequent algorithms would give (spuriously) extra weight to this error.
The shorter story is that sometimes it breaks the independence assumption of the error model that most tools follow.
Matted, You have rightly summarized the issue. Do you have any suggestion to solve this problem. Sorry being little impatient.
Thanks
BWA as such failed when I used it for aligning. Any other suggestion, It will be very easy if I can align to a reference without any manipulation of data.However, it has been suggested that overlapping reads need to be merged first (http://thegenomefactory.blogspot.com/2012/11/tools-to-merge-overlapping-paired-end.html).
It seems like the link you supply lists a few competent tools for doing this. FLASH seems worth a shot. Alternatively, can you just use only the forward mate pairs for a preliminary analysis?
Overlapping ends have existed since the first day of Solexa sequencing. Aligning overlapping ends is never a problem. The problem is at the SNP calling stage, you would want to collapse the overlap to reduce false SNPs caused by PCR errors.
Ih3 and Matt,
Just to follow up; I used BWA as well as Bowtie to see if I can align either forward read or with both PE reads. Most of the alignments (99%) is giving score of 77 and 141- none of reads are mapped. I am aligning for bacterial genome
Do you mean "flag"? Flag 77/141 indicates the read is unmapped, which has nothing to do with overlapping ends. It is more likely that your reads are of poor quality or you are using a wrong reference. Why your bacterial genome is only 6kb long?
I remember hearing or reading that GATK will handle this the right way automatically, but samtools will not. I'd be happy to learn if that's true or not...
What you said is true.