Question

Estimate Insert Size In Paired-End/Mate-Pair

18

Entering edit mode

13.6 years ago

Ric ▴ 190

Hello,

I have Illumina paired-end/mate-pair reads and I would like to find out the min and max insert size is. For Aligner like SOAP2 is necessary to specify the insert size?

What is the best way to calculate the min and max insert size?

Thank you in advance.

next-gen sequencing alignment paired • 53k views

ADD COMMENT • link updated 3.8 years ago by Ram 45k • written 13.6 years ago by Ric ▴ 190

4

Entering edit mode

Usually it does not matter too much. Giving the max a large value (e.g. 1000bp for typical Illumina reads) is fine. Or use BWA, which estimates the insert size on the fly.

ADD REPLY • link 13.6 years ago by lh3 33k

Ram · Answer 1 · 2012-05-17

18

Entering edit mode

13.3 years ago

csmiller ▴ 180

From a SAM/BAM file, you can use Picard:

http://picard.sourceforge.net/command-line-overview.shtml#CollectInsertSizeMetrics

You can also get mean and standard deviation quickly from a SAM/BAM file using these two awk scripts, replacing YOUR_MEAN in the second line with the output of the first line:

head -10000 mappings.sam | awk '{if ($9 > 0) {S+=$9; T+=1}}END{print "Mean: " S/T}' 
head -10000 mappings.sam | awk 'M=YOUR_MEAN{if ($9 > 0) {S+=($9-M)*($9-M); T+=1}}END{print "StdDev: " sqrt(S/T)}'

ADD COMMENT • link 13.3 years ago by csmiller ▴ 180

1

Entering edit mode

Or calculate both stats simultaneously:

awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9*$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((S2-M*M*N)/(N-1))}'

Note that this (and your examples) counts all pairs twice, you probably should check the flags to only count the "left" ends or something.

ADD REPLY • link 13.0 years ago by Ketil 4.2k

0

Entering edit mode

I know I'm late in the game, but how would you check for left ends only, using awk?

ADD REPLY • link 12.3 years ago by Lee Katz ★ 3.2k

0

Entering edit mode

continuing the late-comment theme... For bowtie alignments, the TLEN field ($9 in awk) is negative for the reverse-strand end, so I don't think this will actually count twice with bowtie. This will, however, double count read pairs that align more than one place. You're right that you could do this better by bitmasking the FLAG field for first in pair and primary alignment, but not all versions of awk have this capability.

ADD REPLY • link 11.9 years ago by csmiller ▴ 180

0

Entering edit mode

Can you please explain how this formula works:

sqrt ((S2-M*M*N)/(N-1))

ADD REPLY • link 9.0 years ago by Medhat 9.8k

0

Entering edit mode

It gives strange results for mean, and I don't know why. For my sam file, for 10000 lines it gives 217 (picard gives 198). This script for all reads gives 1032,79. Overall it seems that the value increases along with the read number.

Edit.

Other, smaller, file. Picard mean = 156
Qualimap and mentioned awk script mean=1262.04

More strangely, the histogram of insert size has values ending at 475, median is 165.

If there are some outlier rejection in Picard? It looks to me that mean and sd are quite objective values.

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 6.0 years ago by boczniak767 ▴ 880

Ram · Answer 2 · 2012-01-23

I had the same question and helped myself by aligning my reads in several runs with some useful values for this parameter and calculating the insert size distribution afterwards with the script from this thread.

I had to estimate the parameter for a TopHat run and I tried several parameter values for the insert size and I found out that this parameter almost had no impact, at least on the dataset I used.

Hope that helped!

Ram · Answer 3 · 2012-01-23

4

Entering edit mode

13.6 years ago

Leszek 4.2k

Usually (if there is no reference genome), I get some crappy assembly with or without PE info (SOAPdenovo/Velvet) and then I align 100k paired reads (BWA or Bowtie). From that you will get quite accurate mean and SD of insert size.

ADD COMMENT • link updated 3.8 years ago by Ram 45k • written 13.6 years ago by Leszek 4.2k

Ram · Answer 4 · 2012-01-23

3

Entering edit mode

13.6 years ago

Wen.Huang ★ 1.2k

The only rough way to know those numbers pre-alignment would be to look at the gel image (or bioanalyzer's digital gel). For post-alignment reads, 'picard tools' may have a utility that does what you want.

ADD COMMENT • link updated 3.8 years ago by Ram 45k • written 13.6 years ago by Wen.Huang ★ 1.2k

Ram · Answer 5 · 2012-04-11

3

Entering edit mode

13.4 years ago

madk00k ▴ 370

You can use Qualimap tool to calculate insert size distribution from BAM file.

ADD COMMENT • link updated 3.8 years ago by Ram 45k • written 13.4 years ago by madk00k ▴ 370

Ram · Answer 6 · 2012-09-12

2

Entering edit mode

13.0 years ago

Ketil 4.2k

A bit late to the party, but I wrote a small tool to generate various statistics on insert sizes. It does require you to map to a reference, but the reference doesn't have to be very good as you only need enough matches to get an estimate.

More info here: http://blog.malde.org/posts/bamstats.html

ADD COMMENT • link 13.0 years ago by Ketil 4.2k

1

Entering edit mode

One note: computing std.dev and so on may be greatly affected by outliers. That is why bwa first excludes outliers and then performs the calculation.

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 13.0 years ago by lh3 33k

1

Entering edit mode

Yes, I know. This was my primary motivation for doing quantiles instead, but I should probably do some outlier removal for the statistics. (Currently, my program streams over the data in linear time and constant space, and outlier removal will probably break that. :-( )

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 13.0 years ago by Ketil 4.2k

0

Entering edit mode

Hey, I tried your program (bamstats) and it seems like it requires a specific version of samtools. I couldn't get it installed, is there any way I can test it out?

Thanks!

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 11.2 years ago by arnstrm ★ 1.9k