I am having difficulty navigating all of the considerations when thinking about strand-specific sequencing, and how to determine which actual strand a read comes from.
Here is the setup: I have paired-end Illumina sequence generated with a strand-specific library. It was put through Tophat using the first-strand option, and now I have a SAM file. While browsing the SAM file, I see my read pairs and am interpreting their flag stat, but my mind is going in circles trying to determine if the pair mapped to the actual (+) or (-) strand, which I need to know for distinguishing counts in cases of overlapping genes. I realize that reads spanning junctions have the XS field, but I want to find strandedness from just the flag stat.
Consider 3 pairs of reads that are properly paired in mapping:
READ_ID_A 99 Chr1 155388 50 76M = 155531 219 TTCAGCTTCTTTGAATCTCTTGACGTTGTGTAGAAGCCATTTGTATGATTCATCTTTTCGGTCTTGACACGGATCG
READ_ID_A 147 Chr1 155531 50 76M = 155388 -219 CACACGACACCGTTTCGTCTAGCTTCGGCAAGTGAAGCAGAAACGTGAGGACGTTGGCATTTGATGCATAGAAAAT
READ_ID_B 99 Chr1 180537 50 76M = 180672 211 TGCGCTTGTGGTTGATCTTTCTTCTCTCCTTCCTTCTTATCGCCACCTTCTTTCTTCTCTTCTTCCTTCTTCGGTG
READ_ID_B 147 Chr1 180672 50 76M = 180537 -211 CCACCACCTTCCTTCTTCGGCTCCTCCTTCTTCTCCTTTTCCGGCTCTTTCGCAGGTCCCACTAGTACGATATCCG
READ_ID_C 163 Chr1 285419 50 75M = 285474 130 GGCTACTATGGCCCTGTCTCTGGCTGAAGCTAGGCTTCAGGTTATAGTGGAATCACTTGAAGCTGGTGCAGGGAA
READ_ID_C 83 Chr1 285474 50 75M = 285419 -130 CTTGAAGCTGGTGCAGGGAATGATATTCCACATGTGTCTGAAGAGACTGAGGAAACAATAGATGTTAATGATAAA
Looking at just the first two reads in READ_ID_A
, we can decipher the flag stat to find out that the first read is R1, and is from the (-) strand. Likewise, the second read is R2, and is from the (+) strand.
For READ_ID_B
, the first read is R1 and (-), and the second is R2 and (+).
However, for READ_ID_C
, the first read is R2 (-) and the second is R1 (+).
So, from this information, how can I tell which of these pairs of reads should be assigned to the (+) or (-) strand of the reference genome? I was thinking perhaps I should just check the strandedness of R1, and use that. So, if R1 turns out to be (+), then I will assign the R1/R2 pair to the (+) of the reference. Is this wrong?
EDIT: Perhaps I interpreted the flag stat incorrectly? Does anyone reach the same conclusion as me?