I've been spending quite some time on following problem: I sequenced a bacterial genome using paired-end reads (SOLiD) and I have a quite good reference sequence. My goal is to detect changes in the sequenced sample compared to the reference sequence.
The detection of SNPs and small indels (a couple of bp long) is quite straightforward using the standard tools (SAMtools, GATK). However I'm stuck on the task of detecting larger Indels (tens to hundreds of bp). I tried several software and stuck with Pindel (upon a recommondation on this forum).
Because I didn't know wether to trust Pindels output, I started to simulate some data (introducing indels of several sizes into the reference), mapping the original data to the reference and checking wether Pindel was able to detect the changes. Pindel is very sensitive and could detect most of those indels, however its sensitivity is also the main problem. I find it quite impossible to differentiate between true indels and false positives. There is no good statistic regarding the significance of an observation other than the raw number of supporting reads.
My questions:
- What does someone having more experience in this kind of work recommend? Any other software tools? Another approach? Or will I have to accept the fact that paired end short reads are not optimal to answer this kind of question?
- For the next time: What sequencing approach would you recommend? 454 reads with de novo assembly and subsequent comparison of the contigs to the reference? PacBio? Does Illumina offer a better approach?
Thanks for any help!