How Reliable Are Statistics Derived From "Subsampling" Of Paired-End Illumina Reads ?
1
2
Entering edit mode
12.2 years ago

I have often seen people calculating various statistics such as "percent of overlapping paired end reads from the total", " average distance between pair ends" etc. from just a subsample of data.

I always wonder how reliable these statistics are. For eg. "percent of overlapping paired end reads from the total reads" . Suppose you find that 10% of the pairs overlap. But that's just a subsample. If you take the entire sample, this number can drastically increase or decrease, right ?

illumina • 3.1k views
ADD COMMENT
1
Entering edit mode
12.2 years ago
Ketil 4.1k

You can test this yourself easily enough. For instance, my 'bamstats' program has an option -n to select a limited number of reads (in order to speed things up). Of course, it is usually fast enough (a few minutes at most on large BAM files) that you can easily run it on the whole data set, so I use this option mostly for testing. The most obvious difference is that a lot of unmapped reads appear to come last (this is probably a sorted BAM file), so that statistic rises drastically when all data is included.

nmd999X:..askell/bamstats % dist/build/bam/bam stats -n 1000 0207.bam 
## Input file: 0207.bam
#Alignment               count     prop    mean   stdev    skew    kurt
innies                     469  93.80%  37993.2  3429.9    -0.1    -1.0
outies                       0   0.00%      NaN     NaN     NaN     NaN
lefties                      0   0.00%      NaN     NaN     NaN     NaN
righties                     0   0.00%      NaN     NaN     NaN     NaN

Total reads:     1000
 unmapped:         78 (7.8%)
 orphans:          78 (7.8%)
 split pairs:       0 (0.1%)

nmd999X:..askell/bamstats % dist/build/bam/bam stats -n 10000 0207.bam
## Input file: 0207.bam
#Alignment               count     prop    mean   stdev    skew    kurt
innies                    3607  72.14%  36874.7  3882.6     0.4    -0.7
outies                       0   0.00%      NaN     NaN     NaN     NaN
lefties                      0   0.00%      NaN     NaN     NaN     NaN
righties                     0   0.00%      NaN     NaN     NaN     NaN

Total reads:    10000
 unmapped:       1475 (14.8%)
 orphans:        1475 (14.8%)
 split pairs:      34 (0.7%)

nmd999X:..askell/bamstats % dist/build/bam/bam stats -n 100000 0207.bam
## Input file: 0207.bam
#Alignment               count     prop    mean   stdev    skew    kurt
innies                   35250  70.50%  37166.6  3825.2     0.1    -0.0
outies                       0   0.00%      NaN     NaN     NaN     NaN
lefties                      0   0.00%      NaN     NaN     NaN     NaN
righties                     0   0.00%      NaN     NaN     NaN     NaN

Total reads:   100000
 unmapped:      14632 (14.6%)
 orphans:       14632 (14.6%)
 split pairs:     133 (0.3%)

nmd999X:..askell/bamstats % dist/build/bam/bam stats  0207.bam  
## Input file: 0207.bam
#Alignment               count     prop    mean   stdev    skew    kurt
innies                  161010  46.00%  37351.9  4099.5     0.1     0.1
outies                       0   0.00%      NaN     NaN     NaN     NaN
lefties                      6   0.00%  28524.8 27577.8     0.2    -1.9
righties                     0   0.00%      NaN     NaN     NaN     NaN

Total reads:   700106
 unmapped:     268467 (38.3%)
 orphans:      106222 (15.2%)
 split pairs:    1709 (0.5%)
ADD COMMENT
0
Entering edit mode

(I notice that the percentages are confusing, sometimes they are percent of pairs, and sometimes of reads. I guess I'll have to clarify that at some point...)

ADD REPLY

Login before adding your answer.

Traffic: 1992 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6