Hi,
I have obtained binding locations of a specific protein on DNA by CHIP-seq. My hypothesis is that within those regions in specific cancer type, I expect to see higher mutation rate than random regions. I think, bootstrapping could solve the solution by comparing random vs my bed file.
For this reason, I searched through biostars and I found that bedtools shuffle is a viable option.
I generated these randomly shuffled bed files with this specific command.
shuffle -chrom -i highconf_250wider.bed -g /Users/morova/bedtools/GRCh37.genome.txt
Is there a way to validate that shuffling executed as I expected. For example there is also ChromFirst option. Is there a statistical method that you guys should suggest me ?
Thank you very much,
Best regards,
Tunc.
Testing for randomness is one of the hardest things to do in statistics - because one of the assumptions of true randomness is that anything can happen! Even very unlikely outcomes are essentially guaranteed to happen some proportion of the time. Practically this means that even if I provide you with the tools to visualize the before/after shuffled distribution, we won't be able to say 'nope, that distribution is too weird, can't be random' - because that by definition is something that can happen with a random shuffle.
There are methods to say, with some confidence interval, how random we think the shuffling is - but all those methods will require multiple independant shuffles. How many shuffles? Well, the less random the shuffle truly is, the fewer independant shuffles you'll need to be confident it's not random. But on the assumption that bedtools IS a good, random shuffle.... well.... a lot of shuffles. The tools commonly used are the Ent suite (http://www.fourmilab.ch/random/) and the Diehard suite (http://stat.fsu.edu/pub/diehard/).
A much better way of actually doing this is to look at the code, and make sure that all shuffling steps are known to be sufficiently random. I can't remember what language bedtools is in, I think python, so hopefully they used some generic shuffle function known to work well on a generic data structure. Note though that occasionally it's the case that, due to the insufficient granularity of the numbers from the random number generator, not every possible shuffle outcome is actually possible, so a code review is really the best way to detect these sorts of issues.
First of all thank you for the detailed answer. I appreciate that. If I wasn't shuffling chromosome locations, I would have draw a distribution graph of the numbers and check if it uniform(or looks similar to what it was aimed). On top of that, since I use —chrom option, it keeps the randomization event on the same chromosome. These two specific things makes me wonder if that randomisation is right or wrong.
No problems, I find myself sitting in a dull, grey waiting-room all day today, so plenty of time to help out :)
So I just looked at what bedtools shuffle actually does, and it doesn't do anything like what I expected. I thought it would shuffle the cards, but not change the cards themselves.... but it doesn't shuffle the order of the intervals in the file, it literally changes the interval start positions while maintaining their size. Why this is called a shuffle rather than a permutation I have no idea -_-; This is relevant because for a shuffle, the entire end-of-shuffle result is 1 data point for the randomness of the shuffle. For a permutation like this, each new position of an interval is 1 datapoint for the randomness of the shuffle, so you won't need to do multiple "shuffles".
So on that note, my previous advice doesn't really apply. You can test if this is random in a single "shuffle" by plotting the distribution as you say, but it's raised SO many questions in my head unrelated to the randomness of the shuffle. Can shuffled data include overlapping intervals? If not, we've got a space-filling-search problem, which is REALLY hard to make random. Really hard. If overlapping regions are OK then at least we can have some proper randomness in the new positions, but if your input data doesn't allow for overlapping intervals, this isn't fair.
I don't think using --chrom will cause much of a problem. Some chromosomes will have different interval sizes/distributions than others, and --chrom will maintain that I guess.
But this is just one of those things where, frankly, I wouldn't go down this path at all. Whatever your regions are for, it's guaranteed they could not possibly appear randomly throughout the genome. You've got centromeres/telomeres/gene deserts, long repeats, etc etc. To ignore all that will screw everything up. But I don't have any ideas on how to do it better...
Perhaps, if you're looking for motif enrichment locations or something, you could use another organism's genome that doesn't have the motif/transcription factor as your random shuffle? I don't know. I just know that taking a series of non-overlapping, non-random intervals, and then randomly moving them around, perhaps with overlaps, to a genome that has a variable amount of possibility for your features to exist at all, is not bootstrapping :-/