How Can I Know The Length Of Mapped Reads From Bam File?
6
10
Entering edit mode
11.8 years ago
Jordan ★ 1.3k

Hi,

I have a bam files produced from mapping by LifeScope Tool (mapping of SOLiD reads). I would like to know the length of the reads that were mapped to the reference genome based on the information present in bam files. Is there a tool which can give me the stats of such things from bam files?

Thanks!

bam mapping • 60k views
ADD COMMENT
0
Entering edit mode

Do you want the length after the read is aligned, including insertions and deletions?

ADD REPLY
0
Entering edit mode

I don't need to include insertions and deletions.

ADD REPLY
0
Entering edit mode

Hi Zev, what would the syntax be if I include indels and gaps in the reads?

ADD REPLY
23
Entering edit mode
11.8 years ago

This will take the first million reads and give you counts of each read length. To look at more of the bam, get rid of the "head" command:

samtools view file.bam | head -n 1000000 | cut -f 10 | perl -ne 'chomp;print length($_) . "\n"' | sort | uniq -c
ADD COMMENT
1
Entering edit mode

and do not use only sort but sort -n

ADD REPLY
0
Entering edit mode

Ah! That was very simple to follow and clever use of unix commands.. Thanks for that!

ADD REPLY
0
Entering edit mode

This command outputs the length distribution of both mapped and unmapped reads right? I thought the question was to ask the length distribution output of only mapped reads.

ADD REPLY
2
Entering edit mode

Sure - make it samtools view -F 4 file.bam ... if you want only mapped

ADD REPLY
0
Entering edit mode

thanks a lot! so if I only want unmapped reads length distribution, I just need to make the "F" case lower "f" right? for example samtools view -f 4 file.bam. I tried "samtools view file.bam", "samtools view -F 4 file.bam" and "samtools view -f file.bam", I found that the reads count in a certain length in -f output and -F output did not add up to the same length counts in "samtools view file.bam", I am confused, I thought the mapped and unmapped read counts should added up to the total counts of a certain length in "samtools view file.bam" right?

ADD REPLY
0
Entering edit mode

Hi, thank you for your answer. Do you know how to modify your code if I want to specify the reads in a certain region, for example, want to know the reads length distribution that are only located in chr2?

ADD REPLY
18
Entering edit mode
11.8 years ago
Rm 8.3k

A little shorter version:

samtools view test.bam | awk '{print length($10)}' | head -1000 | sort -u
ADD COMMENT
1
Entering edit mode

What does $10 represent here?

ADD REPLY
1
Entering edit mode

The awk command is printing the string length of the tenth field (segment sequence) out of the SAM-formatted output from samtools view.

ADD REPLY
0
Entering edit mode

That was simple as well. Thanks!

ADD REPLY
0
Entering edit mode

Hi, how can I count the number of reads with certain length based on the above command? Thanks in advance!

ADD REPLY
0
Entering edit mode

I know this is an old comment, but in case someone is wondering the same: samtools view out/viral_bam2/0h_Apli_R1.out.bam | awk '{print length($10)}' | sort | uniq -c

ADD REPLY
3
Entering edit mode
11.8 years ago
Rm 8.3k

if you need to know alignment length of the read on the reference after it aligned: Using bioperl modules

##mapped_read_Length.pl 
#!/usr/bin/perl -w

use strict;
use warnings;  
use Bio::DB::Sam;

my $usage = "Usage $0 <input.bam>\n";
my $infile = shift or die $usage;
my $sam = Bio::DB::Sam->new(-bam  => $infile);
my @chromosomes = $sam->seq_ids;   
print "Read_ID\tChr\tStart\tEnd\tMapped_Length\tCIGAR_String\n";

foreach my $chr (@chromosomes){
   my @alignments = $sam->get_features_by_location(-seq_id => $chr);
   for my $a (@alignments) {
      my $id  = $a->name;
      my $cigar  = $a->cigar_str;
      my $start  = $a->start;
      my $end    = $a->end;
      my $mapped_length = $end-$start;
     print "$id\t$chr\t$start\t$end\t$mapped_length\t$cigar\n";    
}
}
__END__

OUTPUT:

Read_ID    Chr    Start    End    Mapped_Length    CIGAR_String
BMW-S105R:17:H02AWADXX:2:1109:4198:22439    chr1    564455    564503    48    6M1I43M
BMW-S105R:17:H02AWADXX:2:1109:4309:28023    chr1    564682    564732    50    5M1D45M
BMW-S105R:17:H02AWADXX:2:1109:15415:28127    chr1    564689    564737    48    37M1I12M
ADD COMMENT
1
Entering edit mode
11.8 years ago

There are many ways to skin a cat. Below is a c++ implementation of BamTools that will produce what your want. You can get much more information out the the al BamTools::Alignment object.

More about BamTools https://github.com/pezmaster31/bamtools

#include  "api/api_global.h"
#include  "api/BamReader.h"

using namespace std;
using namespace BamTools;


int main(int argc, const char * argv[])
{
   std::string filename = argv[1];
   BamTools::BamReader reader
   BamTools::BamAlignment al;
      while(reader.GetNextAlignment(al)){
            std::cout << al.Length << "\n";

        }
}
ADD COMMENT
0
Entering edit mode

I don't have experience with C++. Can it be implemented in a different language? I have used samtools.1 view to get the basic info. I think the output is in sam format. Probably I can write a code to parse the column 6 - CIGAR and output the length. But I was wondering if there is a tool already to do it. I didn't want to mistakes while computing length from CIGAR.

Looked into bamtools, but like I said I'm not good at C++.

ADD REPLY
0
Entering edit mode

I know that this comment is old, but since it is on the first page of google...

You forgot to open a file in your example.

ADD REPLY
1
Entering edit mode
11.8 years ago

BEDOPS includes a bam2bed conversion script, which filters unmapped reads in the course of converting to BED. You can therefore get mapped read lengths by piping the sequence field of the converted data to awk, e.g.:

$ bam2bed < foo.bam \
    | cut -f 12 \
    | awk '{print length($0)}' \
    > mapped_read_lengths.txt

If you use samtools directly with alternatives suggested in other answers, you might first want to prefilter your data to mapped reads with GNU awk, before piping to downstream utilities:

$ samtools view foo.bam | gawk '(! and(4, $2))' > mapped_foo.sam
$ samtools view -S mapped_foo.sam | ...

Or, more simply:

$ samtools view foo.bam | gawk '(! and(4, $2))' | ...

Where ... denotes the use of downstream utilities.

ADD COMMENT
0
Entering edit mode

The bam2bed will fail if you have spliced alignments. But he didn't specify if DNA or RNA alignment in the question so the answer is still valid.

ADD REPLY
1
Entering edit mode
7.6 years ago
doyourun ▴ 200

If you are given bam files and want to know the READ length that was used in sequencing, this will work. replace bamfile by your bam file (and make sure it was indexed .. e.g. a bamfile.bam.bai exists)

Typically, I need to know this to run a pipeline.

All on one line.

samtools view bamfile.bam | head -n 1000 | gawk '{print length($10)}' | sort | uniq -c | perl -ane '$_ =~ s/^[ ]+//g;print $_' | sort -k 1nr,1nr | head -1 | cut -f2 -d " "

ADD COMMENT

Login before adding your answer.

Traffic: 1906 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6