Hello,
Thank you in advance for anyone who considers this question.
I'm doing some gene set enrichment analysis and have a list of 100 genes I have curated from the literature. This list of genes is very specific for genes involved in a type of chromatin remodeling. I have performed a type of gene set enrichment using this list and found significance enrichment for rare, coding variants in these genes within a disease cohort compared to a control cohort.
What I would like to do now is create several additional control gene sets to see if the enrichment is specific to the chromatin remodeling gene set. What I would like to do is create several gene sets of the same approximately size (about 100 genes) that have overall set metrics the same as my chromatin remodeling set.
For instance, in my chromatin remodeling gene set the average gene length is approximately 72kb. What I'd really like to do is create several additional control gene lists of about the same number (~100) that also have an average gene length of above 72kb. I would also like to extend this to other gene metrics as well such as GC content, replication time, and others.
Does anyone know of an existing program/script that would do this if provided with a gene list that included all the gene lengths, GC content, replication time, etc.. included in it?
Thanks again for thinking about this - it would be a big help.
This is pretty close to what I'm looking for - thanks! However, as I understand this code, the lists that are being generated are simply genes that meet a static criteria such as gene size > 72, GC < 70, etc...
What I'm trying to end up with are gene sets where the average gene length for all the genes in the random set is approximately 72 and so on for GC content, replication time, etc...
Is there a variation on this code in R that can do this? Thanks so much for your help!
You can do that too by creating the boundaries with the
rnorm()
function, in this way you give a range but you don't specify a hard threshold. i.e. for size it could be something like (peeking to the code above):now you have a random gene set of 100 genes whose average-ish size is 72. you can play with the
mean
andsd
values to get the boundaries that you need/want