I am doing some tests to learn more about SAM flags.
In an attempt to split an original BAM file into two forward and reverse BAM files following this tutorial I bumped into a question. I am using paired end data.
This is an example of the alignments I have in the reverse strand BAM file:
A00681:387:HCVJTDRXY:1:2245:2419:34898 99 chr3 85295832 255 139M = 85295832 139 ACGAGTTAGTGGGTGCAACACACCAGCATGGCACATGTATACATATGTAACTAACCTGCACAATGTGCACAAGTACCCTAAAACTTAAAGTATAATAAAAAAAAGAAAAAAAGAAGAAGAAAAAAAAAGTAGTAAACCT FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFF:F:::FFFFFF:F,FFF,F:FFFFFF:,FFFF:F,FF NH:i:1 HI:i:1 AS:i:274 nM:i:1 NM:i:0 MD:Z:139 jM:B:c,-1 jI:B:i,-1 rB:B:i,1,139,85295832,85295970 MC:Z:139M
A00681:387:HCVJTDRXY:1:2245:2419:34898 147 chr3 85295832 255 139M = 85295832 -139 ACGAGTTAGTGGGTGCAACACACCAGCATGGCACATGTATACATATGTAACTAACCTGCACAATGTGCACAAGTACCCTAAAACTTAAAGTATAATAAAAAAAAGAAAAAAAGAAGAAGAAAAAAAAAGTACTAAACCT FFF,FFFFFFFFF:FFFFF::,,FFF,FF,F,F,F:FFFFFFFFFFFFFFFFFFFF,FF:FFFF:F:,FFFF,:FFF:FFFFFF:FFFF,:FFFFFFF,FFFFF:FFFFFFFFFF,,FFFFFFFFFFFFFF,FFF,FFF NH:i:1 HI:i:1 AS:i:274 nM:i:1 NM:i:1 MD:Z:131G7 jM:B:c,-1 jI:B:i,-1 rB:B:i,141,279,85295832,85295970 MC:Z:139M
They have the same QNAME. Different flags:
147:
read paired (0x1)
read mapped in proper pair (0x2)
read reverse strand (0x10)
second in pair (0x80)
99:
read paired (0x1)
read mapped in proper pair (0x2)
mate reverse strand (0x20)
first in pair (0x40)
Maybe I have not understand correctly the meaning of QNAME, but are these the same read?? If so why do they have different flags and 1 nucleotide of a difference (in bold)??
From the SAM format manual: "Reads/segments having identical QNAME are regarded to come from the same template." What exactly they mean by template?
In my quest to understand this, I ran featureCounts two times one with -s 2 (it's a reversely stranded library) and -s 0. -s 0 gives me double counts. I am not sure how to interpret this... Any help?
Thank you!!
Hi, thank you for the answer. As you can see, I have trouble with the very basics here, please bare with me :-). Sorry if i was not clear enough with what I meant with "reverse strand bam file" - From the tutorial I linked "Here is our script that splits a bam file into two fwd.bam and rev.bam each containing the alignments that are in the transcriptional direction." This is exactly what I am after, a BAM file containing the reads on a reverse transcriptional direction. It kinda hit me right now that this has nothing to do with R1/2. Both mates should come from the same direction (right?). My question is then, does the correct way of quantifying features from this file would be with -s 2, right? Otherwise I would be quantifying each mate and that's why I get double counts with -s 0. And the mismatch between the two sequences could be explained by a lack of precision towards the end of the reads (right?). Again, thank you. It's easier to understand by simply bouncing the idea.
See: Insert Size And Fragment Size ? Top answer should clarify this. They come from the
same
fragment and sample the two strands in 5'-->3' direction.Yes, so it's the same fragment (which was transcribed in either + or - direction), read from two directions, one of the mates reading the complementary. A super oversimplification of this - let's forget for a second about PCR duplicates - if we would have a single molecule of a transcript, the correct quantification would be 1, as featureCounts would understand from the flags in the BAM file that this is 1 pair of reads, correct? If so, then how does -s play a role here? why do I see a difference between using -s 0 and -s 2?