Sure you can. The BAM/SAM format contains all that information.
To get the fragment length you need a paired end sequencing. Otherwise you will need to use some tools to process the bam file and estimate the fragment length (the peak caller MACS does that). The fragment length is given in the 9th column (see the Sam Format Specification).
The read length can be obtained either by looking at the CIGAR field (column 6) or by counting the length of the sequence (column 10). The CIGAR field encodes the differences in the read sequence with respect to the reference genome, but usually most reads map perfectly to the reference genome and the CIGAR field can quickly inform you about the read length. A CIGAR will look like this for a read of 100 bp: 100M. In other words if you just see a number followed by M and nothing else, that number is the read length.
The number of fragments is the number of lines in the BAM/SAM file for single end or half the number of lines for paired-end. Here one must be careful, thus is probably is easier to count the lines of the FASTQ file instead. The BAM/SAM format may have been filtered to remove all fragments that did not map, thus a count based on the BAM/SAM file will underestimate the total.
To clarify, the fragment length in the SAM file can be a little misleading because of the way some genome browsers interpret it.
Field 9 (using 1-based, not 0-based counting) is the fragment length, not the insert size:
SEB9BZKS1:207:H8D40ADXX:2:2111:11030:73639 163 chr10 116212135 60 41M = 116212198 164 TTAGAAAGGTTAAAACAATTAATGTATTTTTTTCAACAAAT <>))=1>>7@<?9><)=38>???@>>?==<=86<?<<>7<? NM:i:2 AS:i:31 XS:i:21
SEB9BZKS1:207:H8D40ADXX:2:2111:11030:73639 83 chr10 116212198 60 101M = 116212135 -164 TAAATGCCAATTACACTGACACCAGGAAACACACATCTAGGGCCAGGCACGGTGGCTCATGCCTGTAATCCCAGCACTTTGAGAGGCTGAGGCAGGCGGAT >5>>A:CCCB@>5:;;--,55@;6@A;7667?ACHFFHHGD=9;=C@8.@EFFCIGD@?HD;HHFD?>@F<?DB<2<+CGIIIIHGHFCDADDFDDDD?=1 NM:i:0 AS:i:101 XS:i:57
So, in the above example, the fragment size is 164 bp's. The insert size is much smaller than that. Some browsers will tell you the insert size is 164. It is not.
The insert size in this case is 164 - (101 + 41). That's the fragment length minus the sum of the two reads, or 22 bp's. (The 101 and 41 come from field 6.)
This gets a little odd when pairs overlap:
SEB9BZKS1:207:H8D40ADXX:1:2110:9214:15411 99 chr10 116211652 60 101M = 116211721 170 AGAAAGAAGAAAAGTAGGGGAGGGGAGAGGGGAGAAAGAGAGGAGAAAAAATATTAATAATAATGTTGAAAAGGACAGTATGATGATGACATATGCTGACT =?@DFDF?FCDFBGAEFHGICEFHIGIIIIIG;AFHIIIIEHGCEBDFEC>B@CBDCDECDECDC@>CCCCCCB<8?BC>@DC@CDDCDCCCCDC>@ACCC NM:i:0 AS:i:101 XS:i:0
SEB9BZKS1:207:H8D40ADXX:1:2110:9214:15411 147 chr10 116211721 60 101M = 116211652 -170 AAAGGACAGTATGATGATGACATATGCTGACTTTGCTAAGCACTCTATGCATATTTACTTTAACTCAGGAGGCAGTGCTTAAGAGCTCAAGCTCTGGAATG CCEEECEBC;>>DHEHEC=7DIIJIIHD@4>IGHCHCGIJHFIIIIIJJIHFBGGHBIJIGC9HCIIGHHCJIJIIHEFFJIGIGGIIHHHHD4FFFDCCC NM:i:1 AS:i:96 XS:i:20
Above are two reads 101 bp's in length with a fragment size of 170bp. So, no insert. But, browsers like IGV will gladly tell you that the insert size in this case is 170bp's.
Hope this helps–I was quite confused at first as well.
I am not sure if FastQC gives fragment length (not read length) statistics. I know picard has CollectInsertSizeMetrics program. http://picard.sourceforge.net/command-line-overview.shtml#CollectInsertSizeMetrics
Quick note on fragment length as applied to RNA-Seq, since I've been working on it and haven't found an answer yet.
For paired-end reads from RNA-Seq, column 9 of the SAM file gives you the genomic distance from the beginning of the R1 read to the end of the R2 read. Note that this is NOT necessarily the fragment length, because you could have splicing between the paired-end reads. In other words, the paired-end (R1 and R2) reads could be very far apart from each other because there is a huge intron in between. Take a look at a typical RNA-Seq SAM file, and you will see that the values in column 9 can be quite large (several hundred kb). Again, this is most likely due to the presence of large introns in between.
Obtaining the original fragment length requires knowledge of splice events that might occur in between your paired-end reads, which is not a trivial problem. I have yet to find any tools or scripts to correctly obtain the fragment length that take splicing into account. If anyone knows of any, please share here. Thanks!
It can't be done. Even without splicing events it can't be done from the 9th column (TLEN) alone, since TLEN is the first MAPPED position to the last MAPPED position. You also have to read the CIGAR string, check for indels, clipped bases, etc. And even then who knows what trimming was done before hand. And who knows what differences from the reference occurred in the unsequenced portion of the read.
The BAM doesn't care about fragments. It's all about alignments. If you want fragment lengths either sequence the whole fragment, or for the global distribution use a bioanalyzer.
Take a look at using FastQC on your BAM file.
Hi,
You can get the read length from FASTQC which would give your statistics about all your reads/bases/base composition etc., from raw fastq files/BAM/SAM file. You can get the fragment size from qualimap tool or GATK's Depth of Coverage tool. I personally prefer Qualimap because of the ease of use and nice histograms that it creates.
Good luck
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
source : Torsten Seemann
To underline what @Carlo Yague pointed out. This answer does not represent what most people refer to the term "insert size". See here: Insert Size And Fragment Size ?
This forum lacks a down-vote button (similar to stackexchange)