Proper split alignment needs some special treatments. Most traditional aligners such as blast, blat and ssaha2 will give you contained hits you do not care, while most NGS aligners assume full-length match and will miss fragments. A proper split aligner should surpress contained hits while retaining the maximal non-overlapping fragments. To my limited knowledge, BWA-SW is the first that attempts to produce proper split alignments for sequence reads.
I think you may try BWA-SW to see how it works. For 100bp reads, it may not be very sensitive to 50bp fragments, but for SVs, you do not need ultra-high sensitivity. Analyzing the BWA-SW output is fairly straightforward: if the alignment is split, bwa-sw will give two or more SAM lines. The confidence of the split alignment is measured by the minimal mapping quality between the two fragments of the same read. I have not tried yaha, but from its paper, it seems a right tool, too. The developers has put more efforts in the context of SV discovery (while I have not). It is worth trying.
As to other mappers, BWA, Bowtie1, Soap2, Gsnap, GEM and YOABS assume full-length match. They won't do split alignment. Bowtie2 is not designed with split alignment in mind, but for 100bp reads, it may work in simple scenarios (e.g. both fragments are fairly unique; remember to use --local
). PRISM and Pindel require an anchor mate. They won't find alignments across chromosomes. I heard that mosaik can do proper split alignment, but I do not know the details. Smalt might work for split alignment, but I am not sure, either.
EDIT: BTW, you may also try a new algorithm on the bwa mem branch that has a new component called mem. In theory, it should be more sensitive than bwa-sw for your data, but that branch is only a week old and is unstable.
If I understand correctly, short read aligner like bwa or bowtie were based on seed align mode? means first 33 or 28bp as seed to search the reference database, the remaining bases were used for calculating mismatch etc? Please correct me if I am wrong. Thanks.
Not true for bwa. BWA restricts #mismatches in the seed, but it has to require full-length match. That is different from the standard seed-and-extend. Don't know much about bowtie. From its paper and documentation, it seems to extend allowing arbitrary mismatches in the rest of reads, but if that is true, bowtie should be very sensitive to 100bp reads, but in practice it seems not.
thanks for the details Heng