Entering edit mode
8.1 years ago
epigene
▴
590
I was trying to understand how aligners, specifically bwa, determine which reads to apply the 1024 flag in the sam output.
I notice I have cases where two identical reads (same read seq and same quality scores) are mapped to different locations on yeast chr16 as shown below:
Mxxx:1:2103:26709:7555 0 chr16 560671 0 90M * 0 0 ATTAGTATGTAGAAATATAGATTCCATTTTGAGGATTCCTATATCCTCGAGGAGAACTTCTAGTATATTCTGTATACCTAATATTATAGC CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG MD:Z:90 PG:Z:MarkDuplicates NM:i:0 AS:i:90 XS:i:90
Mxxx:1:2108:18700:2139 0 chr16 844608 0 90M * 0 0 ATTAGTATGTAGAAATATAGATTCCATTTTGAGGATTCCTATATCCTCGAGGAGAACTTCTAGTATATTCTGTATACCTAATATTATAGC CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG MD:Z:90 PG:Z:MarkDuplicates NM:i:0 AS:i:90 XS:i:90
Why are they not considered duplicates? And how are duplicates determined?
What makes you think these two reads have the same read name?
One is
Mxxx:1:2103:26709:7555
, and one isMxxx:1:2108:18700:2139
.Duplicate calling is usually not done by the mapper like BWA, but rather done later by a duplicate-marking tool, in your case it looks like Picard's MarkDuplicates due to
PG:Z:MarkDuplicates
. This tool uses not the DNA sequence but the read's mapped position to decide if it's duplicate or not, and these reads despite having the same DNA have different mapped positions.This is likely because the reads have a MAPQ of 0, which for BWA specifically means that the read maps in multiple locations. In other words, this could very well be a PCR duplicate, however, since it also maps to multiple places on the genome and BWA randomly assigned those reads to different places, they weren't called duplicates by Picard.
I agree this is a really dumb problem to be having but this is how it goes.
I didn't say they have the same read name..and they don't. I meant for the seq and quality.
Yeah, after BWA, the bam file was feed to Picard's MarkDuplicates tool. Thanks for the explanations.
Right, sorry, i'm just trying to help clarify some terminology as often different uses of certain words leads to a lot of confusion down the line. For example, you didn't say "identical read names" you said "identical reads", however a "read" is not defined by the sequenced DNA (which was implied), otherwise PCR duplicates or legitimate biological duplicates would all have the same read name. It is defined by where on the flowcell the read came from, (or more generally, an individual "sequencing event" in the sequencer). Thus, when you said "two identical reads" I was expecting to see two entries from the same sequencing event -- which happens when a mapper like BWA also outputs secondary alignments.
Maybe you already knew all that, but when you spend a lot of time on a support forum trying to help people, it becomes a habit to spell everything out just to make sure everyone is on the same page. I really didn't intend to frustrate you or anything like that :)
I understand.. I should have been more careful with the right terms as there are important distinctions.