Is It Possible To Get Fragment Length, Read Length And Number Of Fragments From A Bam/Sam File
6
10
Entering edit mode
11.8 years ago
samsara ▴ 630

Can i get following information from a BAM/SAM file ?

  • fragment length
  • read length
  • number of fragments

Or, do i need to ask the people who had sequenced the samples ?

bam next-gen • 42k views
ADD COMMENT
11
Entering edit mode
11.5 years ago
Fidel ★ 2.0k

Sure you can. The BAM/SAM format contains all that information.

To get the fragment length you need a paired end sequencing. Otherwise you will need to use some tools to process the bam file and estimate the fragment length (the peak caller MACS does that). The fragment length is given in the 9th column (see the Sam Format Specification).

The read length can be obtained either by looking at the CIGAR field (column 6) or by counting the length of the sequence (column 10). The CIGAR field encodes the differences in the read sequence with respect to the reference genome, but usually most reads map perfectly to the reference genome and the CIGAR field can quickly inform you about the read length. A CIGAR will look like this for a read of 100 bp: 100M. In other words if you just see a number followed by M and nothing else, that number is the read length.

The number of fragments is the number of lines in the BAM/SAM file for single end or half the number of lines for paired-end. Here one must be careful, thus is probably is easier to count the lines of the FASTQ file instead. The BAM/SAM format may have been filtered to remove all fragments that did not map, thus a count based on the BAM/SAM file will underestimate the total.

ADD COMMENT
5
Entering edit mode
10.9 years ago
danrdanny ▴ 70

To clarify, the fragment length in the SAM file can be a little misleading because of the way some genome browsers interpret it.

Field 9 (using 1-based, not 0-based counting) is the fragment length, not the insert size:

SEB9BZKS1:207:H8D40ADXX:2:2111:11030:73639    163    chr10    116212135    60    41M    =    116212198    164    TTAGAAAGGTTAAAACAATTAATGTATTTTTTTCAACAAAT    <>))=1>>7@<?9><)=38>???@>>?==<=86<?<<>7<?    NM:i:2    AS:i:31    XS:i:21
SEB9BZKS1:207:H8D40ADXX:2:2111:11030:73639    83    chr10    116212198    60    101M    =    116212135    -164    TAAATGCCAATTACACTGACACCAGGAAACACACATCTAGGGCCAGGCACGGTGGCTCATGCCTGTAATCCCAGCACTTTGAGAGGCTGAGGCAGGCGGAT    >5>>A:CCCB@>5:;;--,55@;6@A;7667?ACHFFHHGD=9;=C@8.@EFFCIGD@?HD;HHFD?>@F<?DB<2<+CGIIIIHGHFCDADDFDDDD?=1    NM:i:0    AS:i:101    XS:i:57

So, in the above example, the fragment size is 164 bp's. The insert size is much smaller than that. Some browsers will tell you the insert size is 164. It is not.

The insert size in this case is 164 - (101 + 41). That's the fragment length minus the sum of the two reads, or 22 bp's. (The 101 and 41 come from field 6.)

This gets a little odd when pairs overlap:

SEB9BZKS1:207:H8D40ADXX:1:2110:9214:15411    99    chr10    116211652    60    101M    =    116211721    170    AGAAAGAAGAAAAGTAGGGGAGGGGAGAGGGGAGAAAGAGAGGAGAAAAAATATTAATAATAATGTTGAAAAGGACAGTATGATGATGACATATGCTGACT    =?@DFDF?FCDFBGAEFHGICEFHIGIIIIIG;AFHIIIIEHGCEBDFEC>B@CBDCDECDECDC@>CCCCCCB<8?BC>@DC@CDDCDCCCCDC>@ACCC    NM:i:0    AS:i:101    XS:i:0
SEB9BZKS1:207:H8D40ADXX:1:2110:9214:15411    147    chr10    116211721    60    101M    =    116211652    -170    AAAGGACAGTATGATGATGACATATGCTGACTTTGCTAAGCACTCTATGCATATTTACTTTAACTCAGGAGGCAGTGCTTAAGAGCTCAAGCTCTGGAATG    CCEEECEBC;>>DHEHEC=7DIIJIIHD@4>IGHCHCGIJHFIIIIIJJIHFBGGHBIJIGC9HCIIGHHCJIJIIHEFFJIGIGGIIHHHHD4FFFDCCC    NM:i:1    AS:i:96    XS:i:20

Above are two reads 101 bp's in length with a fragment size of 170bp. So, no insert. But, browsers like IGV will gladly tell you that the insert size in this case is 170bp's.

Hope this helps–I was quite confused at first as well.

ADD COMMENT
2
Entering edit mode

Paired-end reads are a neat molecular biology trick. Remember that "insert" refers to the DNA fragment between the adaptors, and not the gap between R1 and R2. Instead we refer to that as the "inner mate distance".

source : Torsten Seemann

ADD REPLY
0
Entering edit mode

To underline what @Carlo Yague pointed out. This answer does not represent what most people refer to the term "insert size". See here: Insert Size And Fragment Size ?

This forum lacks a down-vote button (similar to stackexchange)

ADD REPLY
3
Entering edit mode
11.8 years ago
Sangwoo Kim ▴ 440

I am not sure if FastQC gives fragment length (not read length) statistics. I know picard has CollectInsertSizeMetrics program. http://picard.sourceforge.net/command-line-overview.shtml#CollectInsertSizeMetrics

ADD COMMENT
0
Entering edit mode

You are, of course, correct.

ADD REPLY
2
Entering edit mode
8.0 years ago
jerry ▴ 130

Quick note on fragment length as applied to RNA-Seq, since I've been working on it and haven't found an answer yet.

For paired-end reads from RNA-Seq, column 9 of the SAM file gives you the genomic distance from the beginning of the R1 read to the end of the R2 read. Note that this is NOT necessarily the fragment length, because you could have splicing between the paired-end reads. In other words, the paired-end (R1 and R2) reads could be very far apart from each other because there is a huge intron in between. Take a look at a typical RNA-Seq SAM file, and you will see that the values in column 9 can be quite large (several hundred kb). Again, this is most likely due to the presence of large introns in between.

Obtaining the original fragment length requires knowledge of splice events that might occur in between your paired-end reads, which is not a trivial problem. I have yet to find any tools or scripts to correctly obtain the fragment length that take splicing into account. If anyone knows of any, please share here. Thanks!

ADD COMMENT
1
Entering edit mode

It can't be done. Even without splicing events it can't be done from the 9th column (TLEN) alone, since TLEN is the first MAPPED position to the last MAPPED position. You also have to read the CIGAR string, check for indels, clipped bases, etc. And even then who knows what trimming was done before hand. And who knows what differences from the reference occurred in the unsequenced portion of the read.

The BAM doesn't care about fragments. It's all about alignments. If you want fragment lengths either sequence the whole fragment, or for the global distribution use a bioanalyzer.

ADD REPLY
1
Entering edit mode
11.8 years ago

Take a look at using FastQC on your BAM file.

ADD COMMENT
0
Entering edit mode

I did not get any information regarding fragment length using FastQC

ADD REPLY
1
Entering edit mode
11.5 years ago
venks ▴ 740

Hi,

You can get the read length from FASTQC which would give your statistics about all your reads/bases/base composition etc., from raw fastq files/BAM/SAM file. You can get the fragment size from qualimap tool or GATK's Depth of Coverage tool. I personally prefer Qualimap because of the ease of use and nice histograms that it creates.

Good luck

ADD COMMENT

Login before adding your answer.

Traffic: 1806 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6