I am working with sequencing data from a transposon mutagenesis library. I want to identify the exact site of transposon insertion into a plasmid (i.e. 10 kb genome). I have many millions of 100 bp unpaired reads off the plasmid. I have code that aligns to the plasmid and the transposon sequence and performs some DNA math on reads matching both sequences.
I have tried a number of different aligners - BLAT, bowtie2, subalign - and all of them fail to identify 100% of true positive insertions in a synthetically generated library of reads with 20 bp of homology both to plasmid and transposon. The best I can get is about 92% recall. Is there an aligner out there designed for this task? Identification of perfect or near perfect 15-20 bp matches within short reads? It is important that the software be able to find matches that begin in the middle of the read, of course.
Thanks for your help!
Recall of 92% sounds pretty good, to be honest.
I would expect any tool which is actually meant for this kind of task to perfectly map a synthetic dataset with 20bp overlap at the insertion site. Even once you move into real data (with sequencing errors & variable degrees of overlap) an 8% rate of unmapped reads seems like a lot.
I guess the definition of "recall" isn't clear. I was assuming that the "recall" was for an aligner to correctly align a read with a synthetic "insertion" in it. I agree that a mapping rate of 92% for synthetic data would be unexpected. Perhaps a clarification of "recall" would be useful, but my comment wasn't a very academic one, more of an intuitive comment, so it may not be important to follow up on it.