Hi guys, I have a question regarding the co-expression networks. In particular I have a gene marker and a list of co-expressed genes based on the mutual information. This list is around of 1000 genes (neighbourhood). Of course this is a not manageable number of genes. Is there a way to choose the best or representative number of genes-neighbors according to a threshold of MI value for example? I tried to rank the genes from the highly correlating to the lowly correlating but at some point I have to stop and choose a final number of genes. Is there a way to choose a cut-off point that could be "robust". I have no idea because I know that it depends on the final goal but in my case no experiments are feasible with this huge number of genes.
Could you help me please?
Thanks in advance
You imply that follow-up experiments are the limiting step so why not rank the genes in a way that's relevant to the experiments/the question to address and take the top n with n being what is suitable for follow-up. Also you can use the old elbow rule trick: plot the relevant values in decreasing order and find if there's an elbow. In many real-life data, there is a sharp initial decrease followed by a flat part. The point, not always well defined, at which the curve flattens is usually a good practical cut-off point but that may still give you too many candidates to follow up.
Thank you very much for for answer. The problem is always the same...there's not a clear question and to make inference was asked...
If you have the input expression dataset that was used to compute MI, you may want to consider doing a randomization test to estimate the type I error rate (false positive rate) as a function of MI threshold. You'd recompute MI in randomly re-assorted input data sets, to determine what the false positive rate is at a given MI or correlation cutoff under the null hypothesis of no associations among expression profiles. You would then pick a cutoff that has a low enough false positive rate to satisfy your application. In large data sets, high MI values will occur by chance, and as data set size increases the false positive rate at any given fixed cutoff MI value can become larger. This approach says nothing about biological significance, but would control your false positive rate.
Thank you Ahill. Finally I performed the randomization that seems to ben the only one satisfying criteria to choose a threshold that at the end is a compromise between false positive and false negative findings.