I would like to know how can I find tandem hits of the same query sequence in an indexed bam/sam file?
By this I mean hits of the same sequence in the same location, one hit next to each other, non-overlapping, given a maximum distance D between them, with no other hits from other sequences in the BAM file in between. I assume that given this would be queried in an indexed BAM files, these should be two consecutive entries. As a graphical example below, I am interested in finding the two consecutive hits of sequence 'aaaaaaaaaa' in the target genome:
----------------------------------------------------------------------------
xxxxxxxxx aaaaaaaaaa aaaaaaaaaa bbbbbbbbb ccccccccc
ddddddddd eeeeeeeeeeeee
Are you really interested in finding repeated hits, or is there a biological question that you are trying to answer?
how about dummying up a sam file which has some such hits in it?
It's not clear to me. Can you give a graphical example please ?
It sounds like: find all non-overlapping intervals in the BAM which are not more distant than D from each other. You are correct that these should be adjacent in a sorted file, but you have to account for overlaps if these can exist.
I still don't get "hits of the same sequence in the same location". What is the "sequence" ? what is the "hit" ?. Can your replace those words with "Reference Sequence" and "Short Read" please.
Pierre, agreed more detail would be valuable. My "solution" below assumes "sequence" is that of the "Short Read" and "hit" is the "alignment" (being one row in the sam file). Hoping I'm right....