Estimating Insert Size From Paired End Data.

4

Entering edit mode

11.1 years ago

GouthamAtla 12k

Hi,

I have paired end data from illumina. To estimate the insert size in silico ( from scratch ), I have aligned the reads as single end reads to the genome (mouse). Now I have the two alignment files (SAM/BAM). I would like to estimate the insert size (distribution plot) from the two SAM/BAM files.

The existing tools will look for the "=" field in the SAM format to consider it as the corresponding mate but not read name. Any one is aware of any tool that estimates the insert size based on reads names from the two sam files?

Thanks

picard alignment paired-end • 37k views

ADD COMMENT • link updated 18 months ago by Ram 45k • written 11.1 years ago by GouthamAtla 12k

0

Entering edit mode

I have written a script that calculates insert size from single end alignments. Thanks all for your help.

ADD REPLY • link 11.0 years ago by GouthamAtla 12k

10

Entering edit mode

11.1 years ago

Ashutosh Pandey 12k

Estimating insert size the easy way:

Align the paired-end data in a combined way against the reference genome. By combined, I mean you should provide both the fastq files to the aligner.
Extract information from the ninth column of the SAM file (TLEN). To be more accurate, you can only select the number in the ninth column of those paired-end read pairs that are uniquely aligned against the genome.
Calculate the mean of the distribution of TLEN numbers generated in step 2.
Subtract length of a read (For example 75 bp or 100 bp) from the mean to get insert size.

For aligners like BWA, you need not to give the insert size. It does it for you automatically. See below paragraph from BWA manual to know how it does it:

Estimating Insert Size Distribution

BWA estimates the insert size distribution per 2561024 read pairs. It first collects pairs of reads with both ends mapped with a single-end quality 20 or higher and then calculates median (Q2), lower and higher quartile (Q1 and Q3). It estimates the mean and the variance of the insert size distribution from pairs whose insert sizes are within interval [Q1-2(Q3-Q1), Q3+2(Q3-Q1)]. The maximum distance x for a pair considered to be properly paired (SAM flag 0x2) is calculated by solving equation Phi((x-mu)/sigma)=x/Lp0, where mu is the mean, sigma is the standard error of the insert size distribution, L is the length of the genome, p0 is prior of anomalous pair and Phi() is the standard cumulative distribution function. For mapping Illumina short-insert reads to the human genome, x is about 6-7 sigma away from the mean. Quartiles, mean, variance and x will be printed to the standard error output.

Estimating insert size your way or hard way:

Sort both the SAM files using queryname.
Remove secondary alignments for a read so that read order and read indexes become same in both the SAM files.
Now for every read, calculate absolute difference in their mapping position (Ignore reads that are mapped to different chromosomes
Take the mean of the distribution and subtract the read length.

ADD COMMENT • link updated 5.2 years ago by Ram 45k • written 11.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Ashutosh Pandey.. Thanks. I have the idea how to do it. Just I am looking if there is any tool available.

ADD REPLY • link 11.1 years ago by GouthamAtla 12k

0

Entering edit mode

I apologize, I thought you are a newbie. Well give http://picard.sourceforge.net/command-line-overview.shtml#MergeBamAlignment a try. It may work for you. I mean this feature may update the TLEN values for your alignments in the new bam file.

ADD REPLY • link updated 18 months ago by Ram 45k • written 11.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Dear Ashutosh,

When I take the two reads from both the files, the bed format looks like this:

68331263        68331364        HWI-1KL120:91:C0CBKACXX:1:1101:1420:2186_R1        42      +
68331437        68331536        HWI-1KL120:91:C0CBKACXX:1:1101:1420:2186_R2        42      -

if I take R1 position - R2 position, I end up in negative value. Is there any thing to do with strand information?

ADD REPLY • link updated 5.2 years ago by Ram 45k • written 11.1 years ago by GouthamAtla 12k

0

Entering edit mode

Why do you subtract the read length in the fourth step? Do the insert size include the read length? Insert Size And Fragment Size ?

ADD REPLY • link 7.7 years ago by verne91 ▴ 20

1

Entering edit mode

11.1 years ago

Irsan ★ 7.8k

Picard tools CollectInsertSizeMetrics

ADD COMMENT • link 11.1 years ago by Irsan ★ 7.8k

0

Entering edit mode

Picard tools CollectInsertSizeMetrics takes only one bam file as input. If we merge the two files also..it is not giving any output. The main problem is how the tool makes a decision whether the two reads are paired. Is it based on read names or the "=" tag in the SAM file.

ADD REPLY • link updated 18 months ago by Ram 45k • written 11.1 years ago by GouthamAtla 12k

0

Entering edit mode

I have never used this feature from Picard but I think MergeBAMAlignment should do the job. It can take separate BAM files each representing first and second pair. http://picard.sourceforge.net/command-line-overview.shtml#MergeBamAlignment

ADD REPLY • link updated 18 months ago by Ram 45k • written 11.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

You may be using MergeSamFiles. I don't think it will update or calculate the value of the TLEN or ninth column in the new BAM file.

ADD REPLY • link updated 18 months ago by Ram 45k • written 11.1 years ago by Ashutosh Pandey 12k

1

Entering edit mode

10.8 years ago

Rayan Chikhi ★ 1.6k

This script enables to get an insert size estimation very quickly (based on BWA's intermediate alignment results)

+#!/usr/bin/env python
+doc = """
+Quickly estimates insert sizes of read datasets, given some sequence(s) they can be mapped to.
+Author: Rayan Chikhi
+short usage: <reference> <*.fastq>
+example:
+         estimate-insert-sizes contigs.fa readsA_1.fq readsA_2.fq readsB_1.fq readsB_2.fq
+or, with shell globbing:
+         estimate-insert-sizes contigs.fa *.fq
+special case, a single argument is interpreted as interleaved pairs:
+         estimate-insert-sizes contigs.fa interleaved.fq
+"""
+""" technical note:
+ by default, bwa will be executed with "-t X" to read X*100kbp sequences, instead of just 100kbp.
+kbp is not enough in my experience to detect insert sizes.
+ incidentally, bwa will use X threads, even if the cpu has less cores than that.
+ X can be changed by modifying this variable:
+"""
+nb_threads = 5
+# --------
+from glob import glob
+import subprocess
+import sys, os
+if len(sys.argv) < 3:
+    exit(doc)
+reference = sys.argv[1]
+reads = sorted(sys.argv[2:])
+try:
+    subprocess.call(["bwa"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+except:
+    exit("Please make sure that the `bwa` binary is in your $PATH")
+for read in reads:
+    if not os.path.isfile(read):
+        exit("Error: %s does not exist" % read)
+if len(reads) == 1:
+    print "Assuming that %s is interleaved" % reads[0]
+    reads += [""]
+if not os.path.isfile(reference+".sa"):
+    print "Creating index file.."
+    subprocess.call(["bwa", "index", reference], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+def parse_list(line, nb_elts):
+    # specific to BWA-MEM stderr format
+    return map(lambda x: int(float(x)), ' '.join(line.strip().replace(',','').split()[-nb_elts:])[1:-1].split())
+stats = dict()
+for read1, read2 in zip(reads[::2],reads[1::2]):
+    print( "Processing: \n %s \n %s " % (read1,read2) )
+    cmd = ["bwa", "mem"] + (["-p"] if read2 == "" else []) +  ["-t %d" % nb_threads, reference, read1, read2]
+    DEVNULL = open(os.devnull, 'wb')
+    process = subprocess.Popen(cmd, stdout=DEVNULL, stderr=subprocess.PIPE)
+    seen_candidate_line = False
+    while True:
+        line = process.stderr.readline()
+        if line == '' and process.poll() != None:
+            break
+        if "worker" in line:
+            break
+        if "pestat" not in line:
+            continue
+        if "candidate unique pairs for" in line:
+            if seen_candidate_line:
+                break
+            seen_candidate_line = True
+            nb_pairs = parse_list(line,4)
+            for i in xrange(4):
+                stats[['FF', 'FR', 'RF', 'RR'][i]] = { 'nb_pairs' : nb_pairs[i] }
+        if "orientation" in line:
+            orientation = line.strip().split()[-1].replace('.','')
+        if "mem_pestat] mean and std.dev:" in line:
+            mean, stdev = parse_list(line,2)
+            stats[orientation]['mean'] = mean
+            stats[orientation]['stdev'] = stdev
+            if orientation == 'RR':
+                # stats are complete
+                break
+        sys.stdout.write(line)
+        sys.stdout.flush()
+    if process.poll() is None:
+        process.terminate()
+    results = sorted(stats.items(), key = lambda x: x[1]['nb_pairs'], reverse=True)
+    most_likely = results[0]
+    mean = most_likely[1]['mean']
+    stdev = most_likely[1]['stdev']
+    print "Orientation", most_likely[0], "mean", mean, "stdev", stdev

view raw estimate-insert-sizes hosted with ❤ by GitHub

ADD COMMENT • link updated 5.2 years ago by Ram 45k • written 10.8 years ago by Rayan Chikhi ★ 1.6k

0

Entering edit mode

7.5 years ago

jmodlis • 0

I know this post is really old, but if anyone is still looking for the answer to this (like I was) if you use SOAPdenovo2, the average insert size will be estimated during the assembly -- the average insert size is outputted into the log file, along with the standard deviation. The insert size estimate will change slightly depending on the assembly, but it should at least give you an idea.

ADD COMMENT • link 7.5 years ago by jmodlis • 0

Login before adding your answer.