Need Advice On The Correct Usage For The Genome Structure Correction Statistical Software From The Encode Project
0
6
Entering edit mode
12.6 years ago
Chris Whelan ▴ 570

I'm wondering if anyone has successfully used the Genome Structure Correction statistical software from the ENCODE project (described in the supplement to the Birney et al. 2007 ENCODE Nature paper, and in Bickel et al. Ann Appl Stat 2011), and could offer some advice.

I've got a list of genomic regions, and I'd like to determine if they are enriched for various genomic features (i.e. segmental duplications, repeats, genes, etc.) in a statistically significant way. Typically I do this by permuting the locations of my regions using shuffleBed from BEDTools many times to establish the null distribution of overlaps between a set of similarly sized regions and the features of interest, and then computing the quantile of the actual observed value. However, I've seen some criticisms of this approach based on the argument that randomly permuting genomic locations ignores local properties of the genome, and I think that GSC is supposed to address this problem.

So, using the block-bootstrap-0.8.1 code at http://www.encodestatistics.org/ , I tried to compute a p-value for overlap between my regions and a feature file.

block_bootstrap.py -1 features.bed -2 my_regions.bed -d complement_gaps.bed -r 0.2 -n 500 -t bc -B

complement_gaps.bed was created using complementBed on the "gaps" track from UCSC.

I got a low p-value (0.0!), which was encouraging, but I wanted to test it, so I shuffled my regions with shuffleBed:

shuffleBed -i my_regions.bed -g hg19.chrom.sizes > my_regions_shuffled.bed
block_bootstrap.py -1 features.bed -2 my_regions_shuffled.bed -d complement_gaps.bed -r 0.2 -n 500 -t bc -B

Unfortunately this also gives me a very low p-value, even after repeating it many times. I'd expect shuffled files to generally not have a significant overlap with my features file.

1) Am I using block-bootstrap correctly?

2) My regions are relatively few (~100) and large (~10kb - 200kb). I didn't see anything in the GSC papers and documentation on what type of data works or doesn't work for the test, but it seems like ENCODE mostly is using it on ChIP-SEQ data sets with more, smaller regions of interest. Is this test not appropriate for data like mine?

3) I guessed at how to set the -r parameter, does that make a big difference? Does anyone have any guidelines on how to set parameters for this test? I didn't get much from the papers or documentation.

statistics enrichment encode • 6.5k views
ADD COMMENT
1
Entering edit mode

you have accidentally included a comma in your encodestatistics link. I think the best way to get these questions answered wou be to email the authors of the paper. It would be appreciated if you posted the response here though so other people seeking the same answers will see it.

ADD REPLY
0
Entering edit mode

Added some whitespace after the URL to get rid of the comma automatically added to the link.

I did write to the author of the software but didn't hear back. If anyone else makes any progress I'd love to hear about it.

ADD REPLY
0
Entering edit mode

Hi Chris, when you see a low p-value for your shuffled BED file, did you see a negative or positive Z-score?

ADD REPLY
0
Entering edit mode

I'm getting very negative Z scores for both tests, in the -11 , -12 range.

ADD REPLY
0
Entering edit mode

Dear Chris, I face the similar problem! So do you know how to use the GSC software now? Thank you!

ADD REPLY
0
Entering edit mode

No, I never did figure GSC out. I'm doing location permutations using bedtools shufflebed and also trying to use the GenometriCorr R package to accomplish this task now.

ADD REPLY

Login before adding your answer.

Traffic: 1963 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6