Hi,
I'd like to know if someone has a script for generating a bed file with random genomic regions over the whole human genome. The regions should also follow a given length distribution as input (i.e., mean length = 3000 and some standard deviation,...). In general the length distribution should also be an empirical vectors with all the desired given lengths.
Thanks!
Not sure what you want these regions for, but have you considered that some areas of the genome cannot be sequenced or have large sequence biases, such as occurrence of repeat regions, GC content, etc. So consider whether the random sequences are biologically relevent to you. You may want to consider filtering by repeatmasker or mappability tracks from UCSC.
makes sense, thanks for the suggestion
how do you control for GC content as well? Do you calculate the GC content for all the random regions of a length and see if it matches both the length and the GC content?
I have the same question. How do you control GC content ?