Question

Is It Possible To Get Fragment Length, Read Length And Number Of Fragments From A Bam/Sam File

10

Entering edit mode

11.7 years ago

samsara ▴ 630

Can i get following information from a BAM/SAM file ?

fragment length
read length
number of fragments

Or, do i need to ask the people who had sequenced the samples ?

bam next-gen • 42k views

ADD COMMENT • link updated 7.9 years ago by jerry ▴ 130 • written 11.7 years ago by samsara ▴ 630

score 11 · Answer 1 · 2013-06-17

Sure you can. The BAM/SAM format contains all that information.

To get the fragment length you need a paired end sequencing. Otherwise you will need to use some tools to process the bam file and estimate the fragment length (the peak caller MACS does that). The fragment length is given in the 9th column (see the Sam Format Specification).

The read length can be obtained either by looking at the CIGAR field (column 6) or by counting the length of the sequence (column 10). The CIGAR field encodes the differences in the read sequence with respect to the reference genome, but usually most reads map perfectly to the reference genome and the CIGAR field can quickly inform you about the read length. A CIGAR will look like this for a read of 100 bp: 100M. In other words if you just see a number followed by M and nothing else, that number is the read length.

The number of fragments is the number of lines in the BAM/SAM file for single end or half the number of lines for paired-end. Here one must be careful, thus is probably is easier to count the lines of the FASTQ file instead. The BAM/SAM format may have been filtered to remove all fragments that did not map, thus a count based on the BAM/SAM file will underestimate the total.

score 5 · Answer 2 · 2014-01-24

To clarify, the fragment length in the SAM file can be a little misleading because of the way some genome browsers interpret it.

Field 9 (using 1-based, not 0-based counting) is the fragment length, not the insert size:

SEB9BZKS1:207:H8D40ADXX:2:2111:11030:73639    163    chr10    116212135    60    41M    =    116212198    164    TTAGAAAGGTTAAAACAATTAATGTATTTTTTTCAACAAAT    <>))=1>>7@<?9><)=38>???@>>?==<=86<?<<>7<?    NM:i:2    AS:i:31    XS:i:21
SEB9BZKS1:207:H8D40ADXX:2:2111:11030:73639    83    chr10    116212198    60    101M    =    116212135    -164    TAAATGCCAATTACACTGACACCAGGAAACACACATCTAGGGCCAGGCACGGTGGCTCATGCCTGTAATCCCAGCACTTTGAGAGGCTGAGGCAGGCGGAT    >5>>A:CCCB@>5:;;--,55@;6@A;7667?ACHFFHHGD=9;=C@8.@EFFCIGD@?HD;HHFD?>@F<?DB<2<+CGIIIIHGHFCDADDFDDDD?=1    NM:i:0    AS:i:101    XS:i:57

So, in the above example, the fragment size is 164 bp's. The insert size is much smaller than that. Some browsers will tell you the insert size is 164. It is not.

The insert size in this case is 164 - (101 + 41). That's the fragment length minus the sum of the two reads, or 22 bp's. (The 101 and 41 come from field 6.)

This gets a little odd when pairs overlap:

SEB9BZKS1:207:H8D40ADXX:1:2110:9214:15411    99    chr10    116211652    60    101M    =    116211721    170    AGAAAGAAGAAAAGTAGGGGAGGGGAGAGGGGAGAAAGAGAGGAGAAAAAATATTAATAATAATGTTGAAAAGGACAGTATGATGATGACATATGCTGACT    =?@DFDF?FCDFBGAEFHGICEFHIGIIIIIG;AFHIIIIEHGCEBDFEC>B@CBDCDECDECDC@>CCCCCCB<8?BC>@DC@CDDCDCCCCDC>@ACCC    NM:i:0    AS:i:101    XS:i:0
SEB9BZKS1:207:H8D40ADXX:1:2110:9214:15411    147    chr10    116211721    60    101M    =    116211652    -170    AAAGGACAGTATGATGATGACATATGCTGACTTTGCTAAGCACTCTATGCATATTTACTTTAACTCAGGAGGCAGTGCTTAAGAGCTCAAGCTCTGGAATG    CCEEECEBC;>>DHEHEC=7DIIJIIHD@4>IGHCHCGIJHFIIIIIJJIHFBGGHBIJIGC9HCIIGHHCJIJIIHEFFJIGIGGIIHHHHD4FFFDCCC    NM:i:1    AS:i:96    XS:i:20

Above are two reads 101 bp's in length with a fragment size of 170bp. So, no insert. But, browsers like IGV will gladly tell you that the insert size in this case is 170bp's.

Hope this helps–I was quite confused at first as well.

score 3 · Answer 3 · 2013-02-28

3

Entering edit mode

11.7 years ago

Sangwoo Kim ▴ 440

I am not sure if FastQC gives fragment length (not read length) statistics. I know picard has CollectInsertSizeMetrics program. http://picard.sourceforge.net/command-line-overview.shtml#CollectInsertSizeMetrics

ADD COMMENT • link 11.7 years ago by Sangwoo Kim ▴ 440

0

Entering edit mode

You are, of course, correct.

ADD REPLY • link 11.7 years ago by Sean Davis 27k

score 2 · Answer 4 · 2016-12-22

Quick note on fragment length as applied to RNA-Seq, since I've been working on it and haven't found an answer yet.

For paired-end reads from RNA-Seq, column 9 of the SAM file gives you the genomic distance from the beginning of the R1 read to the end of the R2 read. Note that this is NOT necessarily the fragment length, because you could have splicing between the paired-end reads. In other words, the paired-end (R1 and R2) reads could be very far apart from each other because there is a huge intron in between. Take a look at a typical RNA-Seq SAM file, and you will see that the values in column 9 can be quite large (several hundred kb). Again, this is most likely due to the presence of large introns in between.

Obtaining the original fragment length requires knowledge of splice events that might occur in between your paired-end reads, which is not a trivial problem. I have yet to find any tools or scripts to correctly obtain the fragment length that take splicing into account. If anyone knows of any, please share here. Thanks!

score 1 · Answer 5 · 2013-02-28

1

Entering edit mode

11.7 years ago

Sean Davis 27k

Take a look at using FastQC on your BAM file.

ADD COMMENT • link 11.7 years ago by Sean Davis 27k

0

Entering edit mode

I did not get any information regarding fragment length using FastQC

ADD REPLY • link 11.7 years ago by samsara ▴ 630

score 1 · Answer 6 · 2013-06-17

Hi,

You can get the read length from FASTQC which would give your statistics about all your reads/bases/base composition etc., from raw fastq files/BAM/SAM file. You can get the fragment size from qualimap tool or GATK's Depth of Coverage tool. I personally prefer Qualimap because of the ease of use and nice histograms that it creates.

Good luck