Hello everyone!
I am trying to learn more about the SAM/BAM format and how reads are mapped to the genome.
I understand that reads can be mapped to the forward (sense) and reverse (antisense) strand.
I understand that in every read pair there is a first-in-pair and a second-in-pair read - and that this probably has something to do with the order in which they came off the sequencer and not the position in the genome? (but maybe im wrong)
I understand that weird things can happen. A pair of reads can map millions of base pairs apart. Both reads can map to the same strand. But can reads map in both upstream and downstream direction - i.e. the read sequence once reversed (but not reversed complimented) was mapped.
With three variables thats 32 different combinations for paired reads (and another 4 for singletons).
With two variables, there's only 8 different combinations (and 2 for singletons)
Regardless, does anyone know of any tool available which will give you the statistics for the 32+4 or 8+2 combinations in your BAM file? If not, i will add it to the new version of metaflagstat.
Here's a picture of what i'm trying to explain :)
but most of the time, you don't really want to know if there is a diffrence in the count of `both reads are mapped + 1st read is forward, 2nd read is reverse' vs `both reads are mapped + 2st read is forward, 1st read is reverse' (furthermore, are they properly mapped ?) etc... people `use samtools view` to exclude/include anc count reads , or `samtools flagstats`
I agree, it is somewhat less interesting than the core information flagstat gives you - but I suppose for a sanity-check it would be nice to know that those two possibilities occur roughly the same amount and their isn't first-in-pair = sense strand bias (which actually there is, but its not enough for people to care about..)
The main point of this post was to find out if there is an orientation variable. If an aligner will both try the reverse-compliment AND just-reverse on a read, and if so where the latter information is stored :)
Sense and anti-sense have no general meaning when talking about the genome. That only makes sense in terms of RNAseq when mapped to the genome, since then while the alignments are in genomic coordinates, the strand can also dictate sense information in transcriptome-space.
In general, the possible mapping orientations will depend on the aligner and the settings you give it. For example, you can disable discordant and singleton alignments in many aligners.
Thank you for your comment Devon (and as always you too Pierre!), but I fear I am not explaining myself very well.
Perhaps sense and anti-sense is a misnomer, but a read can be mapped in 4 possible ways:
I don't know if aligners do the latter two, and my question is simply, do they - and if so - where is that information stored?