All on one line.

Question

How Can I Know The Length Of Mapped Reads From Bam File?

10

Entering edit mode

11.7 years ago

Jordan ★ 1.3k

Hi,

I have a bam files produced from mapping by LifeScope Tool (mapping of SOLiD reads). I would like to know the length of the reads that were mapped to the reference genome based on the information present in bam files. Is there a tool which can give me the stats of such things from bam files?

Thanks!

bam mapping • 60k views

ADD COMMENT • link updated 12 months ago by Lluís R. ★ 1.2k • written 11.7 years ago by Jordan ★ 1.3k

0

Entering edit mode

Do you want the length after the read is aligned, including insertions and deletions?

ADD REPLY • link 11.7 years ago by Zev.Kronenberg 12k

0

Entering edit mode

I don't need to include insertions and deletions.

ADD REPLY • link 11.7 years ago by Jordan ★ 1.3k

0

Entering edit mode

Hi Zev, what would the syntax be if I include indels and gaps in the reads?

ADD REPLY • link 3.7 years ago by lovegenetics ▴ 70

3

Entering edit mode

11.7 years ago

Rm 8.3k

if you need to know alignment length of the read on the reference after it aligned: Using bioperl modules

##mapped_read_Length.pl 
#!/usr/bin/perl -w

use strict;
use warnings;  
use Bio::DB::Sam;

my $usage = "Usage $0 <input.bam>\n";
my $infile = shift or die $usage;
my $sam = Bio::DB::Sam->new(-bam  => $infile);
my @chromosomes = $sam->seq_ids;   
print "Read_ID\tChr\tStart\tEnd\tMapped_Length\tCIGAR_String\n";

foreach my $chr (@chromosomes){
   my @alignments = $sam->get_features_by_location(-seq_id => $chr);
   for my $a (@alignments) {
      my $id  = $a->name;
      my $cigar  = $a->cigar_str;
      my $start  = $a->start;
      my $end    = $a->end;
      my $mapped_length = $end-$start;
     print "$id\t$chr\t$start\t$end\t$mapped_length\t$cigar\n";    
}
}
__END__

OUTPUT:

Read_ID    Chr    Start    End    Mapped_Length    CIGAR_String
BMW-S105R:17:H02AWADXX:2:1109:4198:22439    chr1    564455    564503    48    6M1I43M
BMW-S105R:17:H02AWADXX:2:1109:4309:28023    chr1    564682    564732    50    5M1D45M
BMW-S105R:17:H02AWADXX:2:1109:15415:28127    chr1    564689    564737    48    37M1I12M

ADD COMMENT • link 11.7 years ago by Rm 8.3k

1

Entering edit mode

11.7 years ago

Zev.Kronenberg 12k

There are many ways to skin a cat. Below is a c++ implementation of BamTools that will produce what your want. You can get much more information out the the al BamTools::Alignment object.

More about BamTools https://github.com/pezmaster31/bamtools

#include  "api/api_global.h"
#include  "api/BamReader.h"

using namespace std;
using namespace BamTools;


int main(int argc, const char * argv[])
{
   std::string filename = argv[1];
   BamTools::BamReader reader
   BamTools::BamAlignment al;
      while(reader.GetNextAlignment(al)){
            std::cout << al.Length << "\n";

        }
}

ADD COMMENT • link 11.7 years ago by Zev.Kronenberg 12k

0

Entering edit mode

I don't have experience with C++. Can it be implemented in a different language? I have used samtools.1 view to get the basic info. I think the output is in sam format. Probably I can write a code to parse the column 6 - CIGAR and output the length. But I was wondering if there is a tool already to do it. I didn't want to mistakes while computing length from CIGAR.

Looked into bamtools, but like I said I'm not good at C++.

ADD REPLY • link 11.7 years ago by Jordan ★ 1.3k

0

Entering edit mode

I know that this comment is old, but since it is on the first page of google...

You forgot to open a file in your example.

ADD REPLY • link 7.0 years ago by Tg ▴ 320

1

Entering edit mode

11.7 years ago

Alex Reynolds 36k

BEDOPS includes a bam2bed conversion script, which filters unmapped reads in the course of converting to BED. You can therefore get mapped read lengths by piping the sequence field of the converted data to awk, e.g.:

$ bam2bed < foo.bam \
    | cut -f 12 \
    | awk '{print length($0)}' \
    > mapped_read_lengths.txt

If you use samtools directly with alternatives suggested in other answers, you might first want to prefilter your data to mapped reads with GNU awk, before piping to downstream utilities:

$ samtools view foo.bam | gawk '(! and(4, $2))' > mapped_foo.sam
$ samtools view -S mapped_foo.sam | ...

Or, more simply:

$ samtools view foo.bam | gawk '(! and(4, $2))' | ...

Where ... denotes the use of downstream utilities.

ADD COMMENT • link 11.7 years ago by Alex Reynolds 36k

0

Entering edit mode

The bam2bed will fail if you have spliced alignments. But he didn't specify if DNA or RNA alignment in the question so the answer is still valid.

ADD REPLY • link 5.2 years ago by opplatek ▴ 300

1

Entering edit mode

7.6 years ago

doyourun ▴ 200

If you are given bam files and want to know the READ length that was used in sequencing, this will work. replace bamfile by your bam file (and make sure it was indexed .. e.g. a bamfile.bam.bai exists)

Typically, I need to know this to run a pipeline.

All on one line.

samtools view bamfile.bam | head -n 1000 | gawk '{print length($10)}' | sort | uniq -c | perl -ane '$_ =~ s/^[ ]+//g;print $_' | sort -k 1nr,1nr | head -1 | cut -f2 -d " "

ADD COMMENT • link 7.6 years ago by doyourun ▴ 200

score 23 · Accepted Answer · 2013-02-27

23

Entering edit mode

11.7 years ago

Chris Miller 22k

This will take the first million reads and give you counts of each read length. To look at more of the bam, get rid of the "head" command:

samtools view file.bam | head -n 1000000 | cut -f 10 | perl -ne 'chomp;print length($_) . "\n"' | sort | uniq -c

ADD COMMENT • link 11.7 years ago by Chris Miller 22k

1

Entering edit mode

and do not use only sort but sort -n

ADD REPLY • link 8.2 years ago by Korsocius ▴ 260

0

Entering edit mode

Ah! That was very simple to follow and clever use of unix commands.. Thanks for that!

ADD REPLY • link 11.7 years ago by Jordan ★ 1.3k

0

Entering edit mode

This command outputs the length distribution of both mapped and unmapped reads right? I thought the question was to ask the length distribution output of only mapped reads.

ADD REPLY • link 5.4 years ago by xiaoleiusc ▴ 140

2

Entering edit mode

Sure - make it samtools view -F 4 file.bam ... if you want only mapped

ADD REPLY • link 5.4 years ago by Chris Miller 22k

0

Entering edit mode

thanks a lot! so if I only want unmapped reads length distribution, I just need to make the "F" case lower "f" right? for example samtools view -f 4 file.bam. I tried "samtools view file.bam", "samtools view -F 4 file.bam" and "samtools view -f file.bam", I found that the reads count in a certain length in -f output and -F output did not add up to the same length counts in "samtools view file.bam", I am confused, I thought the mapped and unmapped read counts should added up to the total counts of a certain length in "samtools view file.bam" right?

ADD REPLY • link 5.4 years ago by xiaoleiusc ▴ 140

0

Entering edit mode

Hi, thank you for your answer. Do you know how to modify your code if I want to specify the reads in a certain region, for example, want to know the reads length distribution that are only located in chr2?

ADD REPLY • link 3.1 years ago by Yingzi • 0

zx8754 · Accepted Answer · 2013-02-27

18

Entering edit mode

11.7 years ago

Rm 8.3k

A little shorter version:

samtools view test.bam | awk '{print length($10)}' | head -1000 | sort -u

ADD COMMENT • link updated 4.5 years ago by zx8754 12k • written 11.7 years ago by Rm 8.3k

1

Entering edit mode

What does $10 represent here?

ADD REPLY • link 2.4 years ago by MobiusT ▴ 20

1

Entering edit mode

The awk command is printing the string length of the tenth field (segment sequence) out of the SAM-formatted output from samtools view.

ADD REPLY • link 2.4 years ago by Alex Reynolds 36k

0

Entering edit mode

That was simple as well. Thanks!

ADD REPLY • link 11.7 years ago by Jordan ★ 1.3k

0

Entering edit mode

Hi, how can I count the number of reads with certain length based on the above command? Thanks in advance!

ADD REPLY • link 8.1 years ago by SpreeFU • 0

0

Entering edit mode

I know this is an old comment, but in case someone is wondering the same: samtools view out/viral_bam2/0h_Apli_R1.out.bam | awk '{print length($10)}' | sort | uniq -c

ADD REPLY • link 12 months ago by Lluís R. ★ 1.2k