I am analysing genome features using genome_structural_correction/block_bootstrap.py from http://www.encodestatistics.org/ to assess the correlation between regions resulting from my analysis and hg19 repeatmasker regions downloaded from UCSC.
For my result regions, I have a bed file containing 12739 regions between 1000 and ~5000 bp each, which encompass about 32Mb of hg19:
cat diff.median.g20.bed | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
32193400
Compared to the size of hg19, I calculate the r
value of my dataset to use for block_bootstrap.py
:
cat hg19.extents.bed | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
3095693983
echo "32193400/3095693983" | bc -l
.01039941291897391009
.0104 # rounding up
I run block_bootstrap.py
between my result regions and a repeatmasker bed file using the binary-only (-B
) option:
block_bootstrap.py -1 /tmp/hg19.rmsk.repclass.bed -2 /tmp/diff.median.g20.bed -d hg19.extents.bed -B -r 0.0104 -n 20 -v
And in parallel I shuffle my result regions like this:
shuffleBed -i /tmp/diff.median.g20.bed -g hg19.genome > /tmp/my_regions_shuffled.bed
And run block_bootstrap.py
between the shuffled regions and repeatmasker:
block_bootstrap.py -1 /tmp/hg19.rmsk.repclass.bed -2 /tmp/my_regions_shuffled.bed -d /illumina/development/curium/misc/hg19.extents.bed -B -r 0.0104 -n 20 -v
Now if I look at the results for both, I get a p-value=0.0
and although the Z score of 1909.23333225
for my result regions than the shuffled regions (1512.34240055
), the difference is not huge:
diff.median.g20.bed my_regions_shuffled.bed
Input Files Parse Time: 407.926077843 Input Files Parse Time: 406.489609957
#### Samples
Sample # Stat Sample # Stat
1 0.00384960246773 1 0.016744983716
2 0.00703107250599 2 0.00764473232889
3 0.00495174606857 3 0.00763668446088
4 0.00830551216659 4 0.010381377025
5 0.00651359894589 5 0.00786252841803
6 0.00694955964229 6 0.00867184812393
7 0.00785941293248 7 0.0128478008673
8 0.00300208131406 8 0.00965871219167
9 0.0161754860882 9 0.0161625764828
10 0.00612241385575 10 0.0140784379638
11 0.00758022290081 11 0.00923275700507
12 0.00914677596129 12 0.00795006362919
13 0.0115290340203 13 0.0136765964466
14 0.0171499229021 14 0.0107164633663
15 0.0106131396121 15 0.00794190162492
16 0.00442453871231 16 0.0112743934613
17 0.0106013666295 17 0.00781952621329
18 0.0116470401228 18 0.00664186647169
19 0.0107669804397 19 0.00777346664384
20 0.0131679192161 20 0.0122719061291
Sample Dist Info: Sample Dist Info:
The expected mean under the NULL: 0.0254927935168 The expected mean under the NULL: 0.0103585013634
The sample mean: 0.00886937132522 The sample mean: 0.0103494311285
The normalized bootstrap SD: 0.000383492258427 The normalized bootstrap SD: 0.000301086586107
The real stat: 0.757668995964 The real stat: 0.465704511768
The scaled real stat: 0.732176202447 The scaled real stat: 0.455346010405
Z Score: 1909.23333225 Z Score: 1512.34240055
p-value: 0.0 p-value: 0.0
Execution Time: 20.6973369122 Execution Time: 28.0755569935
I was expecting the result of the test with shuffled regions not to be significant, but it is. Any ideas what might be happening here? Could this be a bug in the script?
Did you check the README? http://www.encodestatistics.org/websvn/filedetails.php?repname=genome_structural_correction&path=%2Fpython_encode_statistics%2Ftrunk%2FREADME.txt
I did, what am I missing from it?
Are you sure about your input files? Besides, testing for correlations requires an --test cc option (bc is the default). I didn't understand your r calculation. Are your features mean distance around 0.01 * 2500bp? It seems that you're testing pretty small things in pretty small blocks.
Cool, will try your suggestions. I think I misunderstood what
-r
should be. How do I guess what-r
should be? How do I know themixing distance of features -1 and -2
?A bit more info. I ran a parameter sweep on
--region_fraction
with a pair of shuffled beds, and I can't really explain what is the relationship between--region_fraction
and the resulting Z Scores and p-values... More troubling, some--region_fraction
values just fail. Here a few:You´ re using too few samples (-n 20 is just for testing). The relationship is described in the GSC draft starting at page 26. What kind of feature are you testing?
Ok, the same loop with
-n 1000
does give meaningful values. It sometimes crashes though with this error:Are you using numpy? It looks like a numerical exception (value too big to be held by your int/long int type). Are your machine 64 bits? Can you edit your coment to put the error inside a code block? It's hard to read it like this. Maybe a pastebin link should be a better option.
I am using a 64 bits machine. How can I know if the script is using numpy? Is numpy a dependency of the script?
It's not a dependency. But, it will use it if available. It seems that the statistical_functions.py (required by base_types.py) is running without numpy. Try to install numpy (or check if it's installed). Numpy has better numerical type support and should get rid of this error.
Numpy is available for the version of python I am using:
This script:
Gives me:
Nah! I looked at the __new__ function. It expects sequence coords. Again, are you sure about your data files? No negative coords? No way too big number to be represented by a long? 'Cause now it really seems to be a bug . . .
I hope my bed files are sane. For the sake of reproducibility, I've placed them here:
https://www.dropbox.com/sh/9hej91hs9uquvbt/Je83Vc7TL0/upload
Your bed files look OK. I wasn't able to reproduce your error (no my_regions_shuffled2.bed in your link). I've made one on my own. Comparing your diff file against its shuffled version using hg19.rmsk gives meaningful results for a broad range of r-values (all slightly negative Z-scores). So, everything seems to work as expected. I suppose you're on a Mac with and old python version.
So should I use
-r 0.01
or just any value will do? I am using CentOS with Python-2.7.3. The my_regions_shuffled2.bed should just be another shuffling as my_regions_shuffled.bed is. So what Z-score/p-value do you get with the shuffled bed?This is similar to a question I asked a while back (and never really found an answer for..) Need advice on the correct usage for the Genome Structure Correction statistical software from the ENCODE project