Question

Best local aligner for finding fusion events?

1

Entering edit mode

9.8 years ago

flamholz ▴ 10

I am working with sequencing data from a transposon mutagenesis library. I want to identify the exact site of transposon insertion into a plasmid (i.e. 10 kb genome). I have many millions of 100 bp unpaired reads off the plasmid. I have code that aligns to the plasmid and the transposon sequence and performs some DNA math on reads matching both sequences.

I have tried a number of different aligners - BLAT, bowtie2, subalign - and all of them fail to identify 100% of true positive insertions in a synthetically generated library of reads with 20 bp of homology both to plasmid and transposon. The best I can get is about 92% recall. Is there an aligner out there designed for this task? Identification of perfect or near perfect 15-20 bp matches within short reads? It is important that the software be able to find matches that begin in the middle of the read, of course.

Thanks for your help!

transposon alignment • 2.2k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by flamholz ▴ 10

1

Entering edit mode

Recall of 92% sounds pretty good, to be honest.

ADD REPLY • link 9.8 years ago by Sean Davis 27k

0

Entering edit mode

I would expect any tool which is actually meant for this kind of task to perfectly map a synthetic dataset with 20bp overlap at the insertion site. Even once you move into real data (with sequencing errors & variable degrees of overlap) an 8% rate of unmapped reads seems like a lot.

ADD REPLY • link 9.8 years ago by alex.rubinsteyn ▴ 190

0

Entering edit mode

I guess the definition of "recall" isn't clear. I was assuming that the "recall" was for an aligner to correctly align a read with a synthetic "insertion" in it. I agree that a mapping rate of 92% for synthetic data would be unexpected. Perhaps a clarification of "recall" would be useful, but my comment wasn't a very academic one, more of an intuitive comment, so it may not be important to follow up on it.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Sean Davis 27k

Ram · Answer 1 · 2015-04-27

2

Entering edit mode

9.8 years ago

Brian Bushnell 20k

I suggest using BBDuk for this purpose; it's designed for exact and near-exact kmer matches. For the transposon:

bbduk.sh in=reads.fq outm=matched.fq k=20 hdist=1 mm=f literal=ACGTACGTACGTACGTACGT

For the plasmid, use ref=plasmid.fa instead of the literal flag.

BBDuk can also trim the reads based on where the transposon is, so you can then then map the remainder to the plasmid, or something like that.

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Brian -- this looks great so far. Can I get BBDuk to output the boundaries of the match? Or does it only output the reads that match and not the position of the match within the read?

ADD REPLY • link 9.8 years ago by flamholz ▴ 10

0

Entering edit mode

In a single pass, it will only filter the reads or trim them or mask them. But, after first filtering, you can get the boundaries of the match like this:

bbduk.sh in=reads.fq out=matched.fq kmask=lc k=20 hdist=1 mm=f literal=...

That will convert the matching bases to lowercase, rather than doing a filter operation. You can alternately convert them to N with kmask=N, or another symbol that is easy to parse (like kmask=Z).

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Brian Bushnell 20k