Hi, everyone,
I want to get uniquely mapped reads, so need to select the NH (number of hits) tag value that is equal to 1. However, why some lines of BAM file do not contain NH tag?
I appreciate any of your comments. Thank you very much!
Hi, everyone,
I want to get uniquely mapped reads, so need to select the NH (number of hits) tag value that is equal to 1. However, why some lines of BAM file do not contain NH tag?
I appreciate any of your comments. Thank you very much!
NH is an optional tag, not all aligners will include it (and those that do will often omit it for reads with only a single hit). Anyway, filter by MAPQ instead.
Hi, we recently discussed the uniquely mapped reads question: C: Bowtie 2 - is there a way to discard reads mapping to multiple locations?
The XS:i
flag is sometimes used, which gives the quality of the next best matching position. So if this is present it is not a uniquely mapped read. But like Devon says it is probably best to get off the uniquely mapped reads train.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hi Devon, thanks a lot for your answer. I have one further question: MAPQ is a parameter to indicate the quality of mapping reads. Is there publication that use MAPQ filtering method? To my knowledge, unique mapping strategy has been well documented, for instance, http://www.sciencedirect.com/science/article/pii/S1534580715000362. This is the reason I want to get uniquely mapped reads rather than MAPQ filtering. I appreciate any of your comments. THANK YOU!
"Uniquely mapped" is far from well documented, it doesn't even have a single definition.
Hi, Devon, Thanks for your comment. I think it's useful, although I am a bit confused. I need to learn more. THANKS.
Maybe worth keeping in mind that you can align any read anywhere in the genome, you just need to allow enough mismatches. So there is no such thing as uniquely mapped, strictly speaking. However, you can ask whether the best hit has alignment score sufficiently higher then the second best to say that the read is "uniquely mapped". Effectively this is what you would do if you filter by mapq, as Devon suggests.
Thanks, dariober. My understanding is: if mapping quality is low, "uniquely mapped" reads becomes useless. So, setting a MAPQ threshold guarantee high-quality reads. Using "multiple mapped" or "uniquely mapped" is not a good criterion to assess gene expressions. Above is my understanding. I hope to catch your idea. THANKS anyway.
An important nuance is that MAPQ is heavily affected not just by the correctness of a read's sequence but also by how well the sample you sequenced happens to match the reference you're aligning against. For example, if you sequenced a human you should expect a fair number of lowish MAPQ but none-the-less absolutely correct alignments due to SNPs.