I have often seen people calculating various statistics such as "percent of overlapping paired end reads from the total", " average distance between pair ends" etc. from just a subsample of data.
I always wonder how reliable these statistics are. For eg. "percent of overlapping paired end reads from the total reads" . Suppose you find that 10% of the pairs overlap. But that's just a subsample. If you take the entire sample, this number can drastically increase or decrease, right ?
You can test this yourself easily enough. For instance, my 'bamstats' program has an option -n to select a limited number of reads (in order to speed things up). Of course, it is usually fast enough (a few minutes at most on large BAM files) that you can easily run it on the whole data set, so I use this option mostly for testing. The most obvious difference is that a lot of unmapped reads appear to come last (this is probably a sorted BAM file), so that statistic rises drastically when all data is included.
nmd999X:..askell/bamstats % dist/build/bam/bam stats -n 1000 0207.bam
## Input file: 0207.bam
#Alignment count prop mean stdev skew kurt
innies 469 93.80% 37993.2 3429.9 -0.1 -1.0
outies 0 0.00% NaN NaN NaN NaN
lefties 0 0.00% NaN NaN NaN NaN
righties 0 0.00% NaN NaN NaN NaN
Total reads: 1000
unmapped: 78 (7.8%)
orphans: 78 (7.8%)
split pairs: 0 (0.1%)
nmd999X:..askell/bamstats % dist/build/bam/bam stats -n 10000 0207.bam
## Input file: 0207.bam
#Alignment count prop mean stdev skew kurt
innies 3607 72.14% 36874.7 3882.6 0.4 -0.7
outies 0 0.00% NaN NaN NaN NaN
lefties 0 0.00% NaN NaN NaN NaN
righties 0 0.00% NaN NaN NaN NaN
Total reads: 10000
unmapped: 1475 (14.8%)
orphans: 1475 (14.8%)
split pairs: 34 (0.7%)
nmd999X:..askell/bamstats % dist/build/bam/bam stats -n 100000 0207.bam
## Input file: 0207.bam
#Alignment count prop mean stdev skew kurt
innies 35250 70.50% 37166.6 3825.2 0.1 -0.0
outies 0 0.00% NaN NaN NaN NaN
lefties 0 0.00% NaN NaN NaN NaN
righties 0 0.00% NaN NaN NaN NaN
Total reads: 100000
unmapped: 14632 (14.6%)
orphans: 14632 (14.6%)
split pairs: 133 (0.3%)
nmd999X:..askell/bamstats % dist/build/bam/bam stats 0207.bam
## Input file: 0207.bam
#Alignment count prop mean stdev skew kurt
innies 161010 46.00% 37351.9 4099.5 0.1 0.1
outies 0 0.00% NaN NaN NaN NaN
lefties 6 0.00% 28524.8 27577.8 0.2 -1.9
righties 0 0.00% NaN NaN NaN NaN
Total reads: 700106
unmapped: 268467 (38.3%)
orphans: 106222 (15.2%)
split pairs: 1709 (0.5%)
(I notice that the percentages are confusing, sometimes they are percent of pairs, and sometimes of reads. I guess I'll have to clarify that at some point...)
(I notice that the percentages are confusing, sometimes they are percent of pairs, and sometimes of reads. I guess I'll have to clarify that at some point...)