Hello!
I am on my way to understanding more deeply SAM flags, and in one of my tests I bumped into a question that I'm having a hard time finding an answer to it.
I had a BAM file with all my alignments. I wanted to split it by the strand the alignments were mapping to, so I followed the script in this tutorial , I'm using paired end data.
Now I have a reverse and a forward BAM file. I'm interested on a locus (chr3:85289931-85295940) which has a forward stranded L1PA2 in the human genome. Now, if I use featureCounts with -s 0 and quantify both BAM files (forward and reverse) in this feature, I get let's say 1,000 and 10 respectively.
I don't understand why I get 10 reads from my reverse BAM file as the L1PA2 is forward stranded. There is nothing else in this region (I also checked other regions just in case this was a special case, but no - this is not a unique scenario).
Here are some lines from the reverse strand SAM file version of my BAM (some of the reads escaping the filtering of the strands?):
A00681:387:HCVJTDRXY:1:2245:2419:34898 99 chr3 85295832 255 139M = 85295832 139 ACGAGTTAGTGGGTGCAACACACCAGCATGGCACATGTATACATATGTAACTAACCTGCACAATGTGCACAAGTACCCTAAAACTTAAAGTATAATAAAAAAAAGAAAAAAAGAAGAAGAAAAAAAAAGTAGTAAACCT FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFF:F:::FFFFFF:F,FFF,F:FFFFFF:,FFFF:F,FF NH:i:1 HI:i:1 AS:i:274 nM:i:1 NM:i:0 MD:Z:139 jM:B:c,-1 jI:B:i,-1 rB:B:i,1,139,85295832,85295970 MC:Z:139M A00681:387:HCVJTDRXY:1:2245:2709:35055 99 chr3 85295832 255 139M = 85295832 139 ACGAGTTAGTGGGTGCAACACACCAGCATGGCACATGTATACATATGTAACTAACCTGCACAATGTGCACAAGTACCCTAAAACTTAAAGTATAATAAAAAAAAGAAAAAAAGAAGAAGAAAAAAAAAGTAGTAAACCT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFF:FFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:276 nM:i:0 NM:i:0 MD:Z:139 jM:B:c,-1 jI:B:i,-1 rB:B:i,1,139,85295832,85295970 MC:Z:139M A00681:387:HCVJTDRXY:1:2245:2419:34898 147 chr3 85295832 255 139M = 85295832 -139 ACGAGTTAGTGGGTGCAACACACCAGCATGGCACATGTATACATATGTAACTAACCTGCACAATGTGCACAAGTACCCTAAAACTTAAAGTATAATAAAAAAAAGAAAAAAAGAAGAAGAAAAAAAAAGTACTAAACCT FFF,FFFFFFFFF:FFFFF::,,FFF,FF,F,F,F:FFFFFFFFFFFFFFFFFFFF,FF:FFFF:F:,FFFF,:FFF:FFFFFF:FFFF,:FFFFFFF,FFFFF:FFFFFFFFFF,,FFFFFFFFFFFFFF,FFF,FFF NH:i:1 HI:i:1 AS:i:274 nM:i:1 NM:i:1 MD:Z:131G7 jM:B:c,-1 jI:B:i,-1 rB:B:i,141,279,85295832,85295970 MC:Z:139M A00681:387:HCVJTDRXY:1:2245:2709:35055 147 chr3 85295832 255 139M = 85295832 -139 ACGAGTTAGTGGGTGCAACACACCAGCATGGCACATGTATACATATGTAACTAACCTGCACAATGTGCACAAGTACCCTAAAACTTAAAGTATAATAAAAAAAAGAAAAAAAGAAGAAGAAAAAAAAAGTAGTAAACCT FFFFFFFFFFF:F,FFFFFFFFFFFF:FF:FF:FFFFFF:FFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFF,FF,FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:276 nM:i:0 NM:i:0 MD:Z:139 jM:B:c,-1 jI:B:i,-1 rB:B:i,141,279,85295832,85295970 MC:Z:139M
Any idea what could be going on?
I checked and my flags for the reverse strand BAM (99 and 147) and I think it makes sense..
99:
- read paired (0x1)
- read mapped in proper pair (0x2)
- mate reverse strand (0x20)
- first in pair (0x40)
147:
- read paired (0x1)
- read mapped in proper pair (0x2)
- read reverse strand (0x10)
- second in pair (0x80)
I am not sure I follow the question, you should be getting different counts, after all you separated the data by strand. What is unexpected about getting different counts?
you don't need to separate the BAM files as Featurecounts can handle the proper stranded counting in the first place.
In this case you should be getting the same numbers using -s 1 and -s 2
Hi! Thank you so much for your answer! I really appreciate the tutorial you did about filtering alignments by strand :-) Yes, I understand I am expecting different counts. I just realised my question had an important typo (-s 0 --> -s 2), it's now corrected, sorry. The question is why don't I get zero counts from the reverse transcription BAM file if the feature I am quantifying is on the forward strand? (when forcing strandness with -s 2) I would assume if I am isolating the alignments from each transcriptional direction that none of the reads from the reverse BAM file should be quantified in a forward stranded feature..
Thank you! I know featurecounts can handle stranded counting - The reason I am doing it this way is that I have stranded and unstranded data of the same samples, and I want to quantify how much "noise" (more like, antisense transcription relative to the features I am interested in) I would be looking at if I would only have the unstranded data.
If I count with featureCounts (-s 2) I would get the quantification in the correct strand the features are in, but in my particular case I want to get numbers of the amount of reads in antisense to the features as well. That's why I filter first, although I guess I could do -s 1 and then -s 2 and should give me the same information, right? :-)