Does anybody know, how is it possible for a read to get both "first in pair" and "second in pair if the flag also means "read mapped in proper pair"? (i.e see Picard tool explanation for flags 195, 211, 227 or 243)
Does anybody know, how is it possible for a read to get both "first in pair" and "second in pair if the flag also means "read mapped in proper pair"? (i.e see Picard tool explanation for flags 195, 211, 227 or 243)
There is no 'first in pair' or 'second in pair', only 'first segment in template' and 'last segment in template'. Although no aligner i've ever used sets these flags for single-end reads, i see no reason why it couldnt. Particularly if some sort of merging tool was used that combines overlapping pairs into a single read. EDIT: There's also this straight from the BAM spec:
• If 0x40 and 0x80 are both set, the read is part of a linear template, but it is neither the first nor the last read. If both 0x40 and 0x80 are unset, the index of the read in the template is unknown. This may happen for a non-linear template or the index is lost in data processing
Also, you can't trust 'proper pair' (which also doesnt exist, it's "properly aligned"), secondary alignment or supplementary alignment if unmapped is also set, so you should check that out too.
The BAM format was written to be future-proof, at a time when the future was unclear. It was perfectly plausible that in the future all sorts of sequencing technologies could be invented where large fragments get sequenced in many spots, so the spec tries to stay away from 'paired' terminology as much as possible. However, it seems that multiple-reads-per-fragment sequencing is not likely to ever happen, and we are more likely to go down the path of a few really really long reads. For this reason, there is a split between what the strict definitions set out in the spec say - which is also what the aligners/tools most likely to follow - and the practical application of the specification that bioinformaticians practice. It upsets me that a read can be "on chromosome 1" but also "unmapped", because that makes no logical sense. However, it's part of SAM spec, and if you're not aware of it, it will come back to bite you. I think, if you're going to work with SAM/BAM files, you really need to be aware of this difference between your intuition/expectations and what the spec actually says, otherwise you'll have errors that you wont catch because no tool will tell you that you're doing something wrong on a spec-compliant BAM file. Well... no tool except Picard. Picard hates everything. :P
These flags READ1
and READ2
may mean different things depending on the library preparation and methodology.
In an Illumina sequencing they refer to the order in which the fragment is sequenced and that means a separation in both space and time. The instrument first produces reads that get placed in the first file, these will be marked READ1
after alignment. Then, some time later, once the READ1
data is complete the fragments attached to the flowcell get complemented, "bend over" to a neighboring spots, and are sequenced again as if it were a new run. This data will go into a file 2 and will be labeled as READ2
in the SAM file.
Hence having a read marked as both READ1
and READ2
at the same time is incorrect considering the "normal" definition. There is probably a story behind why these flags are set as such - someone had to run some tool that would only work if ... fill in the blanks ... the solution was to set the flags a incorrect values
Can you share more information about the data? A few lines of the bam as examples would help too. You can share the flag, chromosome, position, and cigar value to help get an idea of the alignment.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Could you provide a few example lines from the SAM file. This sounds very odd.
I think it should not be possible for a read to have such flags. where did you get the sam from?
Thanks for reply, It looks odd but I got the original BAM file from the TCGA database.
I'm trying to understand if there's a logical explanation for this before I'll assume its data integrity issue
Is there any update on this post, please? I have seem read with 195 flag from TCGA dataset, too, being a bit confused.