The sam ouput of bowtie2 is normal
1
0
Entering edit mode
9.2 years ago
seta ★ 1.9k

Hi all,

I performed the mapping back of reads to a assembled transcriptome using bowtie2 with the default settings. I'm trying to get the average insert size using the command of

head -10000 mappings.sam | awk '{if ($9 > 0) {S+=$9; T+=1}}END{print "Mean: " S/T}'

But, it says that:

awk: cmd. line:1: (FILENAME=- FNR=10000) fatal: division by zero attempted.

Also, the head of sam output is like below:

@HD     VN:1.0  SO:unsorted
@SQ     SN:seq1 LN:797
@SQ     SN:seq2 LN:1925
@SQ     SN:seq3 LN:493
@SQ     SN:seq4 LN:1240
@SQ     SN:seq5 LN:2098
@SQ     SN:seq6 LN:2722
@SQ     SN:seq7 LN:1772
@SQ     SN:seq8 LN:2028
@SQ     SN:seq9 LN:2337

As far as I read I should look for the average insert size on the TLEN (9th) column, where is these columns?

Could you please let me know if this output is normal and how I can calculate the average insert size using them?

Thanks so much,

bowtie2 alignment BAM • 2.2k views
ADD COMMENT
1
Entering edit mode
9.2 years ago

remove the SAM header:

grep -v '^@' mappings.sam | head -1000 | awk '{if ($9 > 0) {S+=$9; T+=1}}END{print "Mean: " S/T}'

but you'd better use common tools like picard CollectInsertSizeMetrics

ADD COMMENT
0
Entering edit mode

Thanks Pierre and your suggestion for Picard. It gave a mean of 173.7 with standard deviation of 67. Could you please let me know if this is usual for PE library (100 bp) on Illumina Hiseq 2000 as I expected the longer size?

In your professional view, is it necessary to check the results with Picard?

ADD REPLY
1
Entering edit mode

The insert size depends on library prep; there is no "usual". Longer is generally better, but 173 is acceptable for most purposes for 2x100bp libraries. Not ideal though.

And, it's always necessary to quantify the insert size distribution using some method (such as mapping). Otherwise you have no way of determining whether library creation was successful, or how much real data you have. Bear in mind, though, that when you are mapping to such short reference sequences, your apparent insert size distribution will be biased toward shorter inserts. You can get a less-biased measure by mapping only to the long contigs. Assuming you are studying randomly-sheared DNA, of course.

ADD REPLY

Login before adding your answer.

Traffic: 3528 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6