Question

How Reliable Are Statistics Derived From "Subsampling" Of Paired-End Illumina Reads ?

2

Entering edit mode

12.2 years ago

thecuriousbiologist ▴ 550

I have often seen people calculating various statistics such as "percent of overlapping paired end reads from the total", " average distance between pair ends" etc. from just a subsample of data.

I always wonder how reliable these statistics are. For eg. "percent of overlapping paired end reads from the total reads" . Suppose you find that 10% of the pairs overlap. But that's just a subsample. If you take the entire sample, this number can drastically increase or decrease, right ?

illumina • 3.1k views

ADD COMMENT • link updated 4.9 years ago by Biostar 20 • written 12.2 years ago by thecuriousbiologist ▴ 550

score 1 · Answer 1 · 2012-09-25

You can test this yourself easily enough. For instance, my 'bamstats' program has an option -n to select a limited number of reads (in order to speed things up). Of course, it is usually fast enough (a few minutes at most on large BAM files) that you can easily run it on the whole data set, so I use this option mostly for testing. The most obvious difference is that a lot of unmapped reads appear to come last (this is probably a sorted BAM file), so that statistic rises drastically when all data is included.

nmd999X:..askell/bamstats % dist/build/bam/bam stats -n 1000 0207.bam 
## Input file: 0207.bam
#Alignment               count     prop    mean   stdev    skew    kurt
innies                     469  93.80%  37993.2  3429.9    -0.1    -1.0
outies                       0   0.00%      NaN     NaN     NaN     NaN
lefties                      0   0.00%      NaN     NaN     NaN     NaN
righties                     0   0.00%      NaN     NaN     NaN     NaN

Total reads:     1000
 unmapped:         78 (7.8%)
 orphans:          78 (7.8%)
 split pairs:       0 (0.1%)

nmd999X:..askell/bamstats % dist/build/bam/bam stats -n 10000 0207.bam
## Input file: 0207.bam
#Alignment               count     prop    mean   stdev    skew    kurt
innies                    3607  72.14%  36874.7  3882.6     0.4    -0.7
outies                       0   0.00%      NaN     NaN     NaN     NaN
lefties                      0   0.00%      NaN     NaN     NaN     NaN
righties                     0   0.00%      NaN     NaN     NaN     NaN

Total reads:    10000
 unmapped:       1475 (14.8%)
 orphans:        1475 (14.8%)
 split pairs:      34 (0.7%)

nmd999X:..askell/bamstats % dist/build/bam/bam stats -n 100000 0207.bam
## Input file: 0207.bam
#Alignment               count     prop    mean   stdev    skew    kurt
innies                   35250  70.50%  37166.6  3825.2     0.1    -0.0
outies                       0   0.00%      NaN     NaN     NaN     NaN
lefties                      0   0.00%      NaN     NaN     NaN     NaN
righties                     0   0.00%      NaN     NaN     NaN     NaN

Total reads:   100000
 unmapped:      14632 (14.6%)
 orphans:       14632 (14.6%)
 split pairs:     133 (0.3%)

nmd999X:..askell/bamstats % dist/build/bam/bam stats  0207.bam  
## Input file: 0207.bam
#Alignment               count     prop    mean   stdev    skew    kurt
innies                  161010  46.00%  37351.9  4099.5     0.1     0.1
outies                       0   0.00%      NaN     NaN     NaN     NaN
lefties                      6   0.00%  28524.8 27577.8     0.2    -1.9
righties                     0   0.00%      NaN     NaN     NaN     NaN

Total reads:   700106
 unmapped:     268467 (38.3%)
 orphans:      106222 (15.2%)
 split pairs:    1709 (0.5%)