Question

How many ways a read pair can be mapped?

1

Entering edit mode

10.1 years ago

John 13k

Hello everyone!
I am trying to learn more about the SAM/BAM format and how reads are mapped to the genome.
I understand that reads can be mapped to the forward (sense) and reverse (antisense) strand.
I understand that in every read pair there is a first-in-pair and a second-in-pair read - and that this probably has something to do with the order in which they came off the sequencer and not the position in the genome? (but maybe im wrong)
I understand that weird things can happen. A pair of reads can map millions of base pairs apart. Both reads can map to the same strand. But can reads map in both upstream and downstream direction - i.e. the read sequence once reversed (but not reversed complimented) was mapped.

With three variables thats 32 different combinations for paired reads (and another 4 for singletons).
With two variables, there's only 8 different combinations (and 2 for singletons)
Regardless, does anyone know of any tool available which will give you the statistics for the 32+4 or 8+2 combinations in your BAM file? If not, i will add it to the new version of metaflagstat.

Here's a picture of what i'm trying to explain :)

sequencing mapping next-gen alignment • 4.8k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 10.1 years ago by John 13k

0

Entering edit mode

but most of the time, you don't really want to know if there is a diffrence in the count of `both reads are mapped + 1st read is forward, 2nd read is reverse' vs `both reads are mapped + 2st read is forward, 1st read is reverse' (furthermore, are they properly mapped ?) etc... people `use samtools view` to exclude/include anc count reads , or `samtools flagstats`

ADD REPLY • link 10.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I agree, it is somewhat less interesting than the core information flagstat gives you - but I suppose for a sanity-check it would be nice to know that those two possibilities occur roughly the same amount and their isn't first-in-pair = sense strand bias (which actually there is, but its not enough for people to care about..)

The main point of this post was to find out if there is an orientation variable. If an aligner will both try the reverse-compliment AND just-reverse on a read, and if so where the latter information is stored :)

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by John 13k

0

Entering edit mode

Sense and anti-sense have no general meaning when talking about the genome. That only makes sense in terms of RNAseq when mapped to the genome, since then while the alignments are in genomic coordinates, the strand can also dictate sense information in transcriptome-space.

In general, the possible mapping orientations will depend on the aligner and the settings you give it. For example, you can disable discordant and singleton alignments in many aligners.

ADD REPLY • link 10.1 years ago by Devon Ryan 105k

0

Entering edit mode

Thank you for your comment Devon (and as always you too Pierre!), but I fear I am not explaining myself very well.

Perhaps sense and anti-sense is a misnomer, but a read can be mapped in 4 possible ways:

As it comes off the sequencer - mapped to sense strand in 5' to 3' direction.
Reverse complimented - mapped to anti-sense strand in 5' to 3' direction
Just reversed - mapped to sense-strand in the 3' to 5' direction. Presumably no aligner does this, but perhaps they try?
Complimented but not reversed - mapped to anti-sense strand in the 3' to 5' direction. Again, I would presume aligners don't do this, but maybe after trying all other options they do.

I don't know if aligners do the latter two, and my question is simply, do they - and if so - where is that information stored?

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by John 13k

1

Entering edit mode

9.4 years ago

Pierre Lindenbaum 164k

(8 months later ...) your answer reminds me a tool I wrote in april to get some statistics about the reads in a BAM: https://github.com/lindenb/jvarkit/wiki/BamStats02 The output is a text file but I wrote a GUI to explore the results:

visualizing 4 samples/in|off target/disjoint-chromosomes https://t.co/Hzwakfzre2 pic.twitter.com/3JEzgyGRb2
— Pierre Lindenbaum (@yokofakun) April 2, 2015

>500 samples https://t.co/Hzwakfzre2 pic.twitter.com/IthtSfKbdu
— Pierre Lindenbaum (@yokofakun) April 3, 2015

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.4 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Woah, that's hypnotic - I'm definitely going to check this out :D Thanks Pierre!

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.4 years ago by John 13k

Ram · Accepted Answer · 2014-11-25

5

Entering edit mode

10.1 years ago

Devon Ryan 105k

Using proper nomenclature, a given read can map to the + strand or reverse complemented to the + strand...which is the same as mapping to the - strand. It will never be mapped reverse to anything. Things a bit more complicated if you're talking about bisulfite converted reads, but aside from that those are the two possibilities. This is at least true with current technologies, but will likely remain so in the future due to the energetics of 3'->5' base addition.

With PE reads, any relative orientation of the mates to each other is possible, though some are vastly more likely than the others.

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 10.1 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks Devon! That's great news :) Certainly makes things a lot easier, and explains why I couldn't find a field for orientation :)

This leaves either 10 or 12 possible combinations for mapping a read pair, depending on whether a singleton can be first/second in a pair (i guess it can, which gives 12 combinations in total, rather than 10).

I wish I could mark it as the answer, but as you only commented all I can do is give you an upvote :P

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 10.1 years ago by John 13k

0

Entering edit mode

As requested, it's now an answer :)

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 10.1 years ago by Devon Ryan 105k

Ram · Accepted Answer · 2015-07-24

I was planning on posting this along with the two programs that uses these read types (a QC program which counts these read types, and a pileup tool (like bedtools, etc) where you can filter on read type - but since they're a long way from being published I'll just post the list for now :)

EDIT: I should add that the formula on the right are also not 100% true. I fell into the trap of assuming the read end is the read start plus the length of the sequence. Its not - you have to parse the CIGAR string. So wherever you see len(SEQ), that really means len(of_all_the_mapped_and_skipped_bases). But that didn't fit. Also the variable PSEQ doesn't exist, which means some values cannot be calculated without knowing the mate's read data too.