Question

Samtools: How To Find Tandem Hits Of Same Query Sequence In An Indexed Bam/Sam File?

4

Entering edit mode

13.5 years ago

Ahdf-Lell-Kocks ★ 1.6k

I would like to know how can I find tandem hits of the same query sequence in an indexed bam/sam file?

By this I mean hits of the same sequence in the same location, one hit next to each other, non-overlapping, given a maximum distance D between them, with no other hits from other sequences in the BAM file in between. I assume that given this would be queried in an indexed BAM files, these should be two consecutive entries. As a graphical example below, I am interested in finding the two consecutive hits of sequence 'aaaaaaaaaa' in the target genome:

----------------------------------------------------------------------------
    xxxxxxxxx  aaaaaaaaaa       aaaaaaaaaa       bbbbbbbbb    ccccccccc
  ddddddddd                                       eeeeeeeeeeeee

samtools bam • 3.4k views

ADD COMMENT • link updated 13.5 years ago by Fred ▴ 790 • written 13.5 years ago by Ahdf-Lell-Kocks ★ 1.6k

1

Entering edit mode

Are you really interested in finding repeated hits, or is there a biological question that you are trying to answer?

ADD REPLY • link 13.5 years ago by Sean Davis 27k

1

Entering edit mode

how about dummying up a sam file which has some such hits in it?

ADD REPLY • link 13.5 years ago by Malcolm.Cook ★ 1.5k

0

Entering edit mode

It's not clear to me. Can you give a graphical example please ?

ADD REPLY • link 13.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

It sounds like: find all non-overlapping intervals in the BAM which are not more distant than D from each other. You are correct that these should be adjacent in a sorted file, but you have to account for overlaps if these can exist.

ADD REPLY • link 13.5 years ago by Michael 56k

0

Entering edit mode

I still don't get "hits of the same sequence in the same location". What is the "sequence" ? what is the "hit" ?. Can your replace those words with "Reference Sequence" and "Short Read" please.

ADD REPLY • link 13.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Pierre, agreed more detail would be valuable. My "solution" below assumes "sequence" is that of the "Short Read" and "hit" is the "alignment" (being one row in the sam file). Hoping I'm right....

ADD REPLY • link 13.5 years ago by Malcolm.Cook ★ 1.5k

score 1 · Answer 1 · 2012-01-11

I'm not sure to understand exactly what you want to do, but the following script may help.

It takes a sorted input sam file and outputs the lines when the consecutive ones corresponds to identical sequences with different coordinates.

#!/usr/bin/perl -w
use strict;
my $prevChr   = -1;
my $prevStart = -1;
my $prevSequence = -1;
my $prevLine = -1;

while(<>){
    if(!/^@/){
        chomp;
        my @tab = split(/\t/);
        my ($chr,$start,$sequence) = @tab[2,3,9];
        if($prevStart ne -1 && $prevChr ne -1 && 
           $prevSequence eq $sequence && 
           $prevChr eq $chr && $start != $prevStart){
            print $prevLine."\n";
            print $_."\n";
        }

        $prevChr   = $chr;
        $prevStart = $start;
        $prevSequence = $sequence;
        $prevLine = $_;
    }
}

You just have to do something like that:

$ samtools view sorted_file.bam | script.pl