Samtools: How To Find Tandem Hits Of Same Query Sequence In An Indexed Bam/Sam File?
1
4
Entering edit mode
12.9 years ago
Ahdf-Lell-Kocks ★ 1.6k

I would like to know how can I find tandem hits of the same query sequence in an indexed bam/sam file?

By this I mean hits of the same sequence in the same location, one hit next to each other, non-overlapping, given a maximum distance D between them, with no other hits from other sequences in the BAM file in between. I assume that given this would be queried in an indexed BAM files, these should be two consecutive entries. As a graphical example below, I am interested in finding the two consecutive hits of sequence 'aaaaaaaaaa' in the target genome:

----------------------------------------------------------------------------
    xxxxxxxxx  aaaaaaaaaa       aaaaaaaaaa       bbbbbbbbb    ccccccccc
  ddddddddd                                       eeeeeeeeeeeee
samtools bam • 3.1k views
ADD COMMENT
1
Entering edit mode

Are you really interested in finding repeated hits, or is there a biological question that you are trying to answer?

ADD REPLY
1
Entering edit mode

how about dummying up a sam file which has some such hits in it?

ADD REPLY
0
Entering edit mode

It's not clear to me. Can you give a graphical example please ?

ADD REPLY
0
Entering edit mode

It sounds like: find all non-overlapping intervals in the BAM which are not more distant than D from each other. You are correct that these should be adjacent in a sorted file, but you have to account for overlaps if these can exist.

ADD REPLY
0
Entering edit mode

I still don't get "hits of the same sequence in the same location". What is the "sequence" ? what is the "hit" ?. Can your replace those words with "Reference Sequence" and "Short Read" please.

ADD REPLY
0
Entering edit mode

Pierre, agreed more detail would be valuable. My "solution" below assumes "sequence" is that of the "Short Read" and "hit" is the "alignment" (being one row in the sam file). Hoping I'm right....

ADD REPLY
1
Entering edit mode
12.9 years ago
Fred ▴ 790

I'm not sure to understand exactly what you want to do, but the following script may help.

It takes a sorted input sam file and outputs the lines when the consecutive ones corresponds to identical sequences with different coordinates.

#!/usr/bin/perl -w
use strict;
my $prevChr   = -1;
my $prevStart = -1;
my $prevSequence = -1;
my $prevLine = -1;

while(<>){
    if(!/^@/){
        chomp;
        my @tab = split(/\t/);
        my ($chr,$start,$sequence) = @tab[2,3,9];
        if($prevStart ne -1 && $prevChr ne -1 && 
           $prevSequence eq $sequence && 
           $prevChr eq $chr && $start != $prevStart){
            print $prevLine."\n";
            print $_."\n";
        }

        $prevChr   = $chr;
        $prevStart = $start;
        $prevSequence = $sequence;
        $prevLine = $_;
    }
}

You just have to do something like that:

$ samtools view sorted_file.bam | script.pl
ADD COMMENT

Login before adding your answer.

Traffic: 2573 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6