I'm having some trouble with repetitively mapping Illumina data. I'd like to remove any reads that have a reasonably close second-best hit in the genome/transcriptome, so if the best hit has an edit distance of n, I'd like to remove it if the second-best hit is within n+x (say n < 5, x < 2).
For historical/collaboration reasons, we've been using bwa, which I know is designed to find second-best hits but not necessarily all hits up to a given edit distance. I believe that the bwa aln -N
option is supposed to find all (more or less) hits up to a given edit distance (without the -N
option, it appears to only find second-best hits with up to a given number of mismatches, not including indels). I'm getting garbage for the hits in the XA tag, eg:
readID1 16 chr5 67488388 20 101M * 0 0 ACAGAGGAACCTTCTTAGTGCTCACCGACCTTATACCCAGCTATGGGGACCAGCTCCAGAAAGCACCTTTGCTCCCAATTGGTTAGCTATAGTAAGGTGCG DDDDDDDC?DCDDDDDDDDDDBDDDDFHHHHHHG@JJJJJJIIHJJJJIJJJJJIJJJJJJJIHGJJJJJIIGJJJJJJJJJIJJJJIHHHHHFFFFF@<C XT:A:U NM:i:1 X0:i:1 X1:i:2 XM:i:1 XO:i:0 XG:i:0 MD:Z:43G57 XA:Z:chr5,-67488388,101M,6;chr5,-67488390,2S99M,6;
Where you'll note that the start coordinates are identical or very similar for all three hits. For reference, this is how I'm calling bwa:
bwa aln -b -O2 -n 6 -t 12 -N ref.fa input.bam | bwa samse ref.fa - input.bam | samtools view -h -b -S -o remapped_N.bam -
I've also tried bowtie2 (very slow) and masai (very memory intensive) for mapping. A few questions:
Am I doing something wrong in how I'm calling bwa?
Does anyone have a better mapper for my particular situation? (I'm actually mapping transcriptome data against a genome including an artificial transcriptome as the rna-seq specific mappers don't seem to do well with finding possible second-best hits, but I'm open to suggestions.)
PS I've looked at this post: For short reads, which aligners find all hits given certain edit distance threshold? but I don't think any of the output was checked for accuracy nor speed.
Have you tried mapping quality cutoffs to achieve what you want? It's not exactly the same as an edit distance cutoff, since it depends on per-base quality scores, but it seems like it might help you if your goal is finding things that are "very unique".
And any reason the gap open penalty (-O) is so low? I haven't thought about it too much, but there's a chance that could be a reason why you're getting some of the "garbage hits".
also, I get the exact same output when I set -O to 5 instead of 2
As I mentioned before, I'm concerned with indels as well as with mismatches. I'm pretty sure that bwa ignores indels in determining the mapping quality. If possible, I would like to know where those second-best hits are, which should be in the XA tag.