Question

Interpreting Genome_Structural_Correction Block_Bootstrap.Py

5

Entering edit mode

11.2 years ago

14134125465346445 ★ 3.6k

I am analysing genome features using genome_structural_correction/block_bootstrap.py from http://www.encodestatistics.org/ to assess the correlation between regions resulting from my analysis and hg19 repeatmasker regions downloaded from UCSC.

For my result regions, I have a bed file containing 12739 regions between 1000 and ~5000 bp each, which encompass about 32Mb of hg19:

cat diff.median.g20.bed | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
32193400

Compared to the size of hg19, I calculate the r value of my dataset to use for block_bootstrap.py:

cat hg19.extents.bed | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
3095693983

echo "32193400/3095693983" | bc -l
.01039941291897391009
.0104 # rounding up

I run block_bootstrap.py between my result regions and a repeatmasker bed file using the binary-only (-B) option:

block_bootstrap.py -1 /tmp/hg19.rmsk.repclass.bed -2 /tmp/diff.median.g20.bed -d hg19.extents.bed -B -r 0.0104 -n 20 -v

And in parallel I shuffle my result regions like this:

shuffleBed -i /tmp/diff.median.g20.bed -g hg19.genome > /tmp/my_regions_shuffled.bed

And run block_bootstrap.py between the shuffled regions and repeatmasker:

block_bootstrap.py -1 /tmp/hg19.rmsk.repclass.bed -2 /tmp/my_regions_shuffled.bed -d /illumina/development/curium/misc/hg19.extents.bed -B -r 0.0104 -n 20 -v

Now if I look at the results for both, I get a p-value=0.0 and although the Z score of 1909.23333225 for my result regions than the shuffled regions (1512.34240055), the difference is not huge:

diff.median.g20.bed                                  my_regions_shuffled.bed

Input Files Parse Time:  407.926077843               Input Files Parse Time:  406.489609957                  

#### Samples                           

Sample #                    Stat                     Sample #                    Stat                 
       1        0.00384960246773                            1          0.016744983716                 
       2        0.00703107250599                            2        0.00764473232889                 
       3        0.00495174606857                            3        0.00763668446088                 
       4        0.00830551216659                            4          0.010381377025                 
       5        0.00651359894589                            5        0.00786252841803                 
       6        0.00694955964229                            6        0.00867184812393                 
       7        0.00785941293248                            7         0.0128478008673                 
       8        0.00300208131406                            8        0.00965871219167                 
       9         0.0161754860882                            9         0.0161625764828                 
      10        0.00612241385575                           10         0.0140784379638                 
      11        0.00758022290081                           11        0.00923275700507                 
      12        0.00914677596129                           12        0.00795006362919                 
      13         0.0115290340203                           13         0.0136765964466                 
      14         0.0171499229021                           14         0.0107164633663                 
      15         0.0106131396121                           15        0.00794190162492                 
      16        0.00442453871231                           16         0.0112743934613                 
      17         0.0106013666295                           17        0.00781952621329                 
      18         0.0116470401228                           18        0.00664186647169                 
      19         0.0107669804397                           19        0.00777346664384                 
      20         0.0131679192161                           20         0.0122719061291                 

Sample Dist Info:                                            Sample Dist Info:                         

        The expected mean under the NULL:  0.0254927935168   The expected mean under the NULL:  0.0103585013634   
        The sample mean:  0.00886937132522                   The sample mean:  0.0103494311285             
        The normalized bootstrap SD:  0.000383492258427      The normalized bootstrap SD:  0.000301086586107      

        The real stat:  0.757668995964                       The real stat:  0.465704511768                 
        The scaled real stat:  0.732176202447                The scaled real stat:  0.455346010405              

        Z Score:  1909.23333225                              Z Score:  1512.34240055                     
        p-value:  0.0                                        p-value:  0.0                         

Execution Time:  20.6973369122                        Execution Time:  28.0755569935

I was expecting the result of the test with shuffled regions not to be significant, but it is. Any ideas what might be happening here? Could this be a bug in the script?

statistics encode • 5.4k views

ADD COMMENT • link 11.2 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

Did you check the README? http://www.encodestatistics.org/websvn/filedetails.php?repname=genome_structural_correction&path=%2Fpython_encode_statistics%2Ftrunk%2FREADME.txt

ADD REPLY • link 11.2 years ago by Jarretinha 3.5k

0

Entering edit mode

I did, what am I missing from it?

ADD REPLY • link 11.2 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

Are you sure about your input files? Besides, testing for correlations requires an --test cc option (bc is the default). I didn't understand your r calculation. Are your features mean distance around 0.01 * 2500bp? It seems that you're testing pretty small things in pretty small blocks.

ADD REPLY • link 11.2 years ago by Jarretinha 3.5k

0

Entering edit mode

Cool, will try your suggestions. I think I misunderstood what -r should be. How do I guess what -r should be? How do I know the mixing distance of features -1 and -2?

-r --region_fraction : the fraction of each region (e.g. chromosome) to take 
in each block-wise sample.  The product of this number, and the minimum 
segment length (e.g. for no segmentation, using whole human chromosomes, the 
minimum segment length would be about 50Mb due to chromosome Y), should be  
larger than the mixing distance of features -1 and -2.  For human, if the  
minimum segment length is around 5Mb (some segmentation has been done, or the
features of interest are not defined everywhere throughout the genome, e.g.  
due to mappability issues), then this number should be at least 0.01 for most
features. This number can be fitted under a stability criterion.

ADD REPLY • link 11.2 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

A bit more info. I ran a parameter sweep on --region_fraction with a pair of shuffled beds, and I can't really explain what is the relationship between --region_fraction and the resulting Z Scores and p-values... More troubling, some --region_fraction values just fail. Here a few:

for r in `seq 0.001 0.01 1`; do echo r=$r; /illumina/thirdparty/genome_structural_correction/block_bootstrap.py -t cc -r $r -n 20 -v -1 /tmp/my_regions_shuffled.bed -2 /tmp/my_regions_shuffled2.bed -d /tmp/hg19.extents.bed 2>/dev/null | grep -B1 p-value; done                                                                                                                        
r=0.001
        Z Score:  6.26023654272
        p-value:  1.92197036064e-10
r=0.011
        Z Score:  2.44690304717
        p-value:  0.00720447964934
r=0.021
        Z Score:  2.25069374835
        p-value:  0.0122024705022
r=0.031
        Z Score:  2.22362360975
        p-value:  0.0130868893541
r=0.041
        Z Score:  3.05276613998
        p-value:  0.00113371283888
r=0.051
r=0.061
r=0.071
        Z Score:  3.14378859707
        p-value:  0.000833879372972
r=0.081
        Z Score:  2.25644483645
        p-value:  0.0120213931752
r=0.091
        Z Score:  2.71628141422
        p-value:  0.00330098872633
r=0.101
        Z Score:  2.12601676377
        p-value:  0.0167509318636
r=0.111
        Z Score:  2.77629579582
        p-value:  0.00274910806795
r=0.121
        Z Score:  2.22180768278
        p-value:  0.0131481526517
r=0.131
r=0.141
        Z Score:  2.81974752029
        p-value:  0.00240307242038
r=0.151
        Z Score:  2.79928777855
        p-value:  0.00256077350936
r=0.161
        Z Score:  2.1116374362
        p-value:  0.0173587795031
r=0.171
        Z Score:  1.99726550586
        p-value:  0.022898174197
r=0.181
        Z Score:  2.7105010823
        p-value:  0.00335908146504
r=0.191
        Z Score:  2.26480045893
        p-value:  0.0117624636328
r=0.201
        Z Score:  2.24273233955
        p-value:  0.0124570399439
r=0.211
        Z Score:  2.20441887863
        p-value:  0.0137474496002
r=0.221
        Z Score:  2.37085639176
        p-value:  0.00887346252954
r=0.231
        Z Score:  2.26328574676
        p-value:  0.0118090401729
r=0.241
        Z Score:  2.23048410119
        p-value:  0.0128576608933
r=0.251
        Z Score:  1.74718939213
        p-value:  0.0403022455888
r=0.261
        Z Score:  1.79123690615
        p-value:  0.0366276430555
r=0.271
        Z Score:  2.15600783524
        p-value:  0.0155415266736
r=0.281
        Z Score:  2.34325516039
        p-value:  0.0095581522463
r=0.291
        Z Score:  1.74253977177
        p-value:  0.0407070192122
r=0.301
        Z Score:  1.88968154691
        p-value:  0.0294002817555
r=0.311
        Z Score:  2.26663970874
        p-value:  0.0117061221344
r=0.321
        Z Score:  2.33169248399
        p-value:  0.00985843724992
r=0.331
        Z Score:  1.87150169185
        p-value:  0.0306377877933
r=0.341
        Z Score:  1.66390900004
        p-value:  0.0480653080906
r=0.351
        Z Score:  2.80226585786
        p-value:  0.0025372518317
r=0.361
r=0.371
        Z Score:  2.08719580884
        p-value:  0.0184352190957
r=0.381
        Z Score:  2.42855733727
        p-value:  0.00757951397791
r=0.391
r=0.401
        Z Score:  2.39211444939
        p-value:  0.00837580890352
r=0.411
        Z Score:  1.50063727537
        p-value:  0.0667247023381
r=0.421
r=0.431
        Z Score:  2.5592592835
        p-value:  0.00524477365175
r=0.441
r=0.451
        Z Score:  1.8784255161
        p-value:  0.0301614888136
r=0.461
        Z Score:  1.95661668828
        p-value:  0.0251962757353
r=0.471
        Z Score:  1.85668527064
        p-value:  0.0316779721437
r=0.481
        Z Score:  2.74345497186
        p-value:  0.00303981927858
r=0.491
        Z Score:  2.03681984437
        p-value:  0.0208340494606
r=0.501
        Z Score:  2.04285760087
        p-value:  0.020533268275
r=0.511
        Z Score:  1.51065210686
        p-value:  0.0654385553259
r=0.521
        Z Score:  2.04564325118
        p-value:  0.0203957414178
r=0.531
        Z Score:  2.19807381223
        p-value:  0.0139719231822
r=0.541
        Z Score:  2.88764765666
        p-value:  0.00194067191839
r=0.551
        Z Score:  1.53718625593
        p-value:  0.0621238525202
r=0.561
        Z Score:  1.9341021519
        p-value:  0.0265502847051
r=0.571
        Z Score:  2.36827964437
        p-value:  0.0089355112338
r=0.581
        Z Score:  1.57336748956
        p-value:  0.0578168751457
r=0.591
        Z Score:  2.96464408351
        p-value:  0.00151516649851
r=0.601
        Z Score:  1.90498455221
        p-value:  0.0283910395894
r=0.611
        Z Score:  1.8137765883
        p-value:  0.0348560641382
r=0.621
        Z Score:  2.29552553598
        p-value:  0.0108515123501
r=0.631
        Z Score:  2.07424010221
        p-value:  0.0190285087779
r=0.641
        Z Score:  1.10715121208
        p-value:  0.134114278499
r=0.651
        Z Score:  3.04995389044
        p-value:  0.00114438250983
r=0.661
        Z Score:  2.21259583822
        p-value:  0.0134627611099
r=0.671
        Z Score:  3.26318358005
        p-value:  0.000550840653687
r=0.681
        Z Score:  1.91833146785
        p-value:  0.0275344973832
r=0.691
        Z Score:  2.02136218431
        p-value:  0.0216211431474
r=0.701
        Z Score:  1.91791852798
        p-value:  0.0275606712568
r=0.711
        Z Score:  1.07515776703
        p-value:  0.141152050019
r=0.721
        Z Score:  1.32120252522
        p-value:  0.0932169222799
r=0.731
        Z Score:  2.57149412603
        p-value:  0.00506303665591
r=0.741
        Z Score:  2.25653110766
        p-value:  0.0120186946959
r=0.751
        Z Score:  3.00509886599
        p-value:  0.00132747268107
r=0.761
        Z Score:  2.76800052837
        p-value:  0.00282006821153
r=0.771
        Z Score:  1.94998956398
        p-value:  0.0255886814594
r=0.781
        Z Score:  0.860174647188
        p-value:  0.194846388915
r=0.791
        Z Score:  1.8032463546
        p-value:  0.035674766732
r=0.801
        Z Score:  2.52823444778
        p-value:  0.00573188858964
r=0.811
        Z Score:  2.05753949338
        p-value:  0.0198171800295
r=0.821
        Z Score:  1.74999919552
        p-value:  0.0400592262722
r=0.831
        Z Score:  2.06251468563
        p-value:  0.0195793804299
r=0.841
        Z Score:  3.36573812229
        p-value:  0.000381695727587
r=0.851
        Z Score:  1.53760850055
        p-value:  0.0620721838847
r=0.861
        Z Score:  1.08557124381
        p-value:  0.138834363956
r=0.871
        Z Score:  2.39032506365
        p-value:  0.0084167331952
r=0.881
        Z Score:  1.66536981557
        p-value:  0.0479194987923
r=0.891
        Z Score:  2.51010889503
        p-value:  0.00603469681016
r=0.901
        Z Score:  1.94285611386
        p-value:  0.0260167728617
r=0.911
        Z Score:  2.00328308251
        p-value:  0.0225734561445
r=0.921
        Z Score:  1.57910683185
        p-value:  0.0571557774619
r=0.931
        Z Score:  1.83269962427
        p-value:  0.0334236266059
r=0.941
        Z Score:  1.85963249705
        p-value:  0.0314687691707
r=0.951
        Z Score:  2.66957146348
        p-value:  0.00379740546971
r=0.961
        Z Score:  1.8013109958
        p-value:  0.0358269378434
r=0.971
        Z Score:  1.69387629927
        p-value:  0.0451443957551
r=0.981
        Z Score:  1.7018286003
        p-value:  0.0443937517134
r=0.991
        Z Score:  1.4917724645
        p-value:  0.0678794023968

ADD REPLY • link 11.2 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

You´ re using too few samples (-n 20 is just for testing). The relationship is described in the GSC draft starting at page 26. What kind of feature are you testing?

ADD REPLY • link 11.2 years ago by Jarretinha 3.5k

0

Entering edit mode

Ok, the same loop with -n 1000 does give meaningful values. It sometimes crashes though with this error:

for r in `seq 0.001 0.01 1`; do echo r=$r; /illumina/thirdparty/genome_structural_correction/block_bootstrap.py -t cc -r $r -n 1000 -v -1 /tmp/my_regions_shuffled.bed -2 /tmp/my_regions_shuffled2.bed -d /tmp/hg19.extents.bed; done

Traceback (most recent call last):
  File "/illumina/thirdparty/genome_structural_correction/block_bootstrap.py", line 1000, in ?
    main()
  File "/illumina/thirdparty/genome_structural_correction/block_bootstrap.py", line 955, in main
    region_fraction, num_samples 
  File "/illumina/thirdparty/genome_structural_correction/block_bootstrap.py", line 735, in conditional_pearson_correlation
    nsamples=nsamples)
  File "/illumina/thirdparty/genome_structural_correction/block_bootstrap.py", line 448, in single_bootstrap_stat
    aggregate_sample_callback, nsamples 
  File "/illumina/thirdparty/genome_structural_correction/block_bootstrap.py", line 221, in single_overlap
    s1 = eA.get_subregion( start_bp, start_bp+sample_length, shift_to_zero=True )
  File "/illumina/thirdparty/genome_structural_correction/base_types.py", line 286, in get_subregion
    length=stop_bp-start_bp
  File "/illumina/thirdparty/genome_structural_correction/base_types.py", line 603, in __init__
    region.__init__(self, intervals_iter, (), name, length)
  File "/illumina/thirdparty/genome_structural_correction/base_types.py", line 151, in __init__
    list.__init__(self, intervals_iter)
  File "/illumina/thirdparty/genome_structural_correction/base_types.py", line 282, in <generator expression>
    (min(item.end,stop_bp)-shift_value)
  File "/illumina/thirdparty/genome_structural_correction/base_types.py", line 65, in __new__
    raise ValueError, 'All values must be non-negative (integers|longs), not (%i, %i), (%s, %s)' % ( start, end, type(start), type(end) )
ValueError: All values must be non-negative (integers|longs), not (0, -77539), (<type 'int'>, <type 'int'>)

ADD REPLY • link 11.2 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

Are you using numpy? It looks like a numerical exception (value too big to be held by your int/long int type). Are your machine 64 bits? Can you edit your coment to put the error inside a code block? It's hard to read it like this. Maybe a pastebin link should be a better option.

ADD REPLY • link 11.2 years ago by Jarretinha 3.5k

0

Entering edit mode

I am using a 64 bits machine. How can I know if the script is using numpy? Is numpy a dependency of the script?

ADD REPLY • link 11.2 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

It's not a dependency. But, it will use it if available. It seems that the statistical_functions.py (required by base_types.py) is running without numpy. Try to install numpy (or check if it's installed). Numpy has better numerical type support and should get rid of this error.

ADD REPLY • link 11.2 years ago by Jarretinha 3.5k

0

Entering edit mode

Numpy is available for the version of python I am using:

This script:

# test numpy script
import numpy
print numpy.__file__

Gives me:

/illumina/thirdparty/python/python-2.6.8/lib/python2.6/site-packages/numpy/__init__.pyc

ADD REPLY • link 11.2 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

Nah! I looked at the __new__ function. It expects sequence coords. Again, are you sure about your data files? No negative coords? No way too big number to be represented by a long? 'Cause now it really seems to be a bug . . .

ADD REPLY • link 11.2 years ago by Jarretinha 3.5k

0

Entering edit mode

I hope my bed files are sane. For the sake of reproducibility, I've placed them here:
https://www.dropbox.com/sh/9hej91hs9uquvbt/Je83Vc7TL0/upload

ADD REPLY • link 11.2 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

Your bed files look OK. I wasn't able to reproduce your error (no my_regions_shuffled2.bed in your link). I've made one on my own. Comparing your diff file against its shuffled version using hg19.rmsk gives meaningful results for a broad range of r-values (all slightly negative Z-scores). So, everything seems to work as expected. I suppose you're on a Mac with and old python version.

ADD REPLY • link 11.2 years ago by Jarretinha 3.5k

0

Entering edit mode

So should I use -r 0.01 or just any value will do? I am using CentOS with Python-2.7.3. The my_regions_shuffled2.bed should just be another shuffling as my_regions_shuffled.bed is. So what Z-score/p-value do you get with the shuffled bed?

ADD REPLY • link 11.2 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

./block_bootstrap.py -1 ~/Downloads/upload/diff.median.g20.bed -2 ~/Downloads/upload/my_regions_shuffled.bed -d ~/Downloads/upload/hg19.extents.bed -r 0.01 -n 1000 -t cc
    Z Score:  -0.460299602549
    p-value:  0.322650593215

ADD REPLY • link 11.2 years ago by Jarretinha 3.5k

0

Entering edit mode

This is similar to a question I asked a while back (and never really found an answer for..) Need advice on the correct usage for the Genome Structure Correction statistical software from the ENCODE project

ADD REPLY • link 11.2 years ago by Chris Whelan ▴ 590