Entering edit mode
8.3 years ago
Ric
▴
440
Hi, I found two following approaches how to estimate stdev and mean in paired-end:
First approach:
samtools view OC.bam | head -n 10000 | cut -f9 > initial.insertsizes.txt
a = read.table("initial.insertsizes.txt")
a.v = a[a[,1]>0,1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx]) # mean
[1] 345.9762
sd(a.v[a.v >= mn & a.v <= mx])
[1] 39.57006
Second approach:
samtools view OC.bam | head -n 10000 | awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9*$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((S2-M*M*N)/(N-1))}'
n=5081, mean=345.141, stdev=70.5826
In both approaches the mean look the same, but stdev is different.
Which stdev is correct?
Thank you in advance.
Mic
I think the second way was trying to use Welford’s method
http://www.johndcook.com/blog/standard_deviation/
in such case it should be
can you try it?
I changed it to
awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9*$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((N*S2-S*S)/N(N-1))}'
, but I gotawk: (FILENAME=- FNR=10000) fatal: function
N' not defined`.Just put quotes around it
It did not work. Could you please provide the full command.
I still get
awk: (FILENAME=- FNR=200000) fatal: function
N' not defined` with the above command.Is this RNA or DNA? If its RNA, you are going to be thrown off by splicing - where reads are in different exons you'll get a much bigger insert size.
You often have read pairs with a very large insert size that bias the mean. Thus, you may want to consider using median and median absolute deviation - median() and mad() in R.
You are only sampling 10000 lines when there may be millions in this file.
Take a look at Qualimap. It can provide all sorts of addtional information about your bam files.
If you are able to go back to original R1/R2 files then you can use BBMerge from BBMap to estimate the insert sizes like described here.
BBMerge provides only Avg Insert and not mean and standard deviation is different to
CollectInsertSizeMetrics
.