Question

Mapping With Second Best Hits

0

Entering edit mode

11.2 years ago

Walter • 0

I'm having some trouble with repetitively mapping Illumina data. I'd like to remove any reads that have a reasonably close second-best hit in the genome/transcriptome, so if the best hit has an edit distance of n, I'd like to remove it if the second-best hit is within n+x (say n < 5, x < 2).

For historical/collaboration reasons, we've been using bwa, which I know is designed to find second-best hits but not necessarily all hits up to a given edit distance. I believe that the bwa aln -N option is supposed to find all (more or less) hits up to a given edit distance (without the -N option, it appears to only find second-best hits with up to a given number of mismatches, not including indels). I'm getting garbage for the hits in the XA tag, eg:

readID1    16    chr5    67488388    20    101M    *    0    0    ACAGAGGAACCTTCTTAGTGCTCACCGACCTTATACCCAGCTATGGGGACCAGCTCCAGAAAGCACCTTTGCTCCCAATTGGTTAGCTATAGTAAGGTGCG    DDDDDDDC?DCDDDDDDDDDDBDDDDFHHHHHHG@JJJJJJIIHJJJJIJJJJJIJJJJJJJIHGJJJJJIIGJJJJJJJJJIJJJJIHHHHHFFFFF@<C    XT:A:U    NM:i:1    X0:i:1    X1:i:2    XM:i:1    XO:i:0    XG:i:0    MD:Z:43G57    XA:Z:chr5,-67488388,101M,6;chr5,-67488390,2S99M,6;

Where you'll note that the start coordinates are identical or very similar for all three hits. For reference, this is how I'm calling bwa:

bwa aln -b -O2 -n 6 -t 12 -N ref.fa input.bam | bwa samse ref.fa - input.bam | samtools view -h -b -S -o remapped_N.bam -

I've also tried bowtie2 (very slow) and masai (very memory intensive) for mapping. A few questions:

Am I doing something wrong in how I'm calling bwa?
Does anyone have a better mapper for my particular situation? (I'm actually mapping transcriptome data against a genome including an artificial transcriptome as the rna-seq specific mappers don't seem to do well with finding possible second-best hits, but I'm open to suggestions.)

PS I've looked at this post: For short reads, which aligners find all hits given certain edit distance threshold? but I don't think any of the output was checked for accuracy nor speed.

bwa mapping • 3.6k views

ADD COMMENT • link updated 11.2 years ago by Istvan Albert 102k • written 11.2 years ago by Walter • 0

0

Entering edit mode

Have you tried mapping quality cutoffs to achieve what you want? It's not exactly the same as an edit distance cutoff, since it depends on per-base quality scores, but it seems like it might help you if your goal is finding things that are "very unique".

And any reason the gap open penalty (-O) is so low? I haven't thought about it too much, but there's a chance that could be a reason why you're getting some of the "garbage hits".

ADD REPLY • link 11.2 years ago by matted 7.8k

0

Entering edit mode

also, I get the exact same output when I set -O to 5 instead of 2

ADD REPLY • link 11.2 years ago by Walter • 0

0

Entering edit mode

As I mentioned before, I'm concerned with indels as well as with mismatches. I'm pretty sure that bwa ignores indels in determining the mapping quality. If possible, I would like to know where those second-best hits are, which should be in the XA tag.

ADD REPLY • link 11.2 years ago by Walter • 0

score 1 · Answer 1 · 2013-10-16

Just a thought... I've done studies that sought artificial gene duplication in large genome assemblies by using Bowtie2 and parsing the output .SAM file to find reads that mapped with high scores to more than one location. If you have paired-end reads, this is especially revealing, since the significance of the finding two mapping locations is increased when the R1 and R2 both map to each location as a proper pair.

When running your Bowtie2 command, use the -k <int> optional parameter. This tells Bowtie2 to report up to int best alignments per read pair (or single-end read). So, if you are looking for reads that indicate duplicated regions in the genome, I would map using Bowtie2 and -k 3. Then, parsing the .SAM file would tell me how many reads mapped uniquely (1 position), how many reads mapped to duplicates (2 positions), and how many reads were low complexity/poor scoring/too common for use (3 position, which include 3 or more places).

Find more in the doc for Bowtie2: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml