Hello.
I have a fastq file that contains targeted panel DNA from two species.
The current methodology is to map it to a combined reference genome.
I know that BWA-MEM gives a mapping quality 0 for for reads that have been mapped to multiple places, and if it's 2 locations, the MAPQ score is 3.
But I'm wondering if this means that it maps EQUALLY to all locations. I believe it is, and in that case that it maps EQUALLY to all locations, it will at random choose one location to report. However, if it maps better to one location than another, I understand that BWA-MEM only reports the highest quality mapped read location. There is a way on BWA-MEM to have the option of listing out all locations (i read somewhere there is, just wanted to confirm).
The reason I'm asking is, I want to be able to identify reads that map to one species over another. Reads that are basically part of a very similar sequence in both species, will be ambiguous. Using the newest software "Disambiguate" (Ahdesmaaki et al), it seems upon first few reads that the tool aligns separately (as opposed to mapping to only one reference) to each species, and then just compiles all ambiguous reads that have equal mapping quality to regions on two species, and takes them out of reads mapping to either species all together. I fear that this might lead to a severe underestimation of coverage of a particular location. However, if we were to go with my method of simply using mapping quality and location to filter, we might risk under/over-counting because of the randomly chosen locations for those equally mapped (ambiguous) reads. The chance that the gene sequence is going to be exactly IDENTICAL between two different species is slim, given the average insert size being longer than 50 bp, but I wanted to what people would do in such situations. In short, is there no way to really know where an ambiguously mapped read comes from? How do people deal with this?
I'm working with mice and human. I know many genes are similar, 80% 1-1 correlation, but not sure how much of it is in coding regions. I'll take a look! thank you!
But again, I feel like there is no way to differentiate between reads that map equally well to both species (although highly unlikely still possible),
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.