Entering edit mode
9.2 years ago
seta
★
1.9k
Hi all,
I performed the mapping back of reads to a assembled transcriptome using bowtie2 with the default settings. I'm trying to get the average insert size using the command of
head -10000 mappings.sam | awk '{if ($9 > 0) {S+=$9; T+=1}}END{print "Mean: " S/T}'
But, it says that:
awk: cmd. line:1: (FILENAME=- FNR=10000) fatal: division by zero attempted.
Also, the head of sam output is like below:
@HD VN:1.0 SO:unsorted
@SQ SN:seq1 LN:797
@SQ SN:seq2 LN:1925
@SQ SN:seq3 LN:493
@SQ SN:seq4 LN:1240
@SQ SN:seq5 LN:2098
@SQ SN:seq6 LN:2722
@SQ SN:seq7 LN:1772
@SQ SN:seq8 LN:2028
@SQ SN:seq9 LN:2337
As far as I read I should look for the average insert size on the TLEN (9th) column, where is these columns?
Could you please let me know if this output is normal and how I can calculate the average insert size using them?
Thanks so much,
Thanks Pierre and your suggestion for Picard. It gave a mean of 173.7 with standard deviation of 67. Could you please let me know if this is usual for PE library (100 bp) on Illumina Hiseq 2000 as I expected the longer size?
In your professional view, is it necessary to check the results with Picard?
The insert size depends on library prep; there is no "usual". Longer is generally better, but 173 is acceptable for most purposes for 2x100bp libraries. Not ideal though.
And, it's always necessary to quantify the insert size distribution using some method (such as mapping). Otherwise you have no way of determining whether library creation was successful, or how much real data you have. Bear in mind, though, that when you are mapping to such short reference sequences, your apparent insert size distribution will be biased toward shorter inserts. You can get a less-biased measure by mapping only to the long contigs. Assuming you are studying randomly-sheared DNA, of course.