I was using HOMER and bedtools (annotateBED with tables from UCSC) to annotate chip-seq peaks and make diagrams where you have X peaks in promoter/intron/exon/upstream/dowstream/etc. However, I would like to create some 'random' reads in the genome to compare against to see if there is enrichment in my chip-seq experiment versus random reads in the genome.
I would like to know 3 things:
- Is there a program out there that would spit out random regions of the human genome?
- Have there been statistical tests developed specifically for this comparison?
- What are some of the other annotation programs that people use?
For the 1st point, I know it is not hard to write something that would dump out 100-mers across the genome randomly (since i am using +/-50bp from chip-seq peak) but the better way to do this would be to take into account mappibility and gc bias right? I am using 36bp Illumina reads but I'm not really looking for a Illumina read simulator. I don't really want to reinvent the wheel since I'm sure somebody out there has already solved this and was hoping someone here would point me towards the right direction.
For the 2nd point, I was thinking of using a t-test for each one of the categories (upstream/downstream/exon/promoter/etc) and correcting for multiple testing. So I would compare # of 100bp windows overlapping promoter identified by chip-seq vs the random ones that I simulated, am I oversimplifying things?
For my last point, I know other programs such as CESA exist and I was wondering what people use / think of these different programs for annotating and what people do when a peak is near two genes (say promoter of one gene and downstream of another gene) do you double count the peak?
i'm curious how to take into account GC content in those random regions? Does each random region have to have the same length and GC content as the one it tries to match?
have a look here: http://homer.salk.edu/homer/motif/index.html