Hi!
The question itself is fairly simple, however I have not come up with a clever solution I feel comfortable with.
What I have: Two differential expression results from DESeq2 LRT test (different samples of same observation, but not treated as biological replicates) of a time course. This means two datasets, each of which can be sorted and ordered based on any metric e.g. adjusted p-value or I guess test-statistic as well (although I am not fully sure what it tells me in an LRT, but it does seem to correlate with the p-values and the variance of these genes i.e genes with high variance and low p-values have high stat values). I also have clustering results of the overlap of DE genes.
What I need: There is a large overlap of DE genes that I am also able to cluster into highly similar clusters between the datasets. I want to prioritize genes say "the top 10 most significant" genes in each cluster. But even regardless of the clustering, how should I go about prioritizing genes across these two datasets?
What is the issue: When using "averaging" approaches, the values of e.g. adjusted p-values are systematically lower in one dataset (I'll call this dataset1) as the replicates were much more consistent in that one, compared to the other one (dataset2). While I know many share the opinon that the adjusted p-value is often more useful as just a hard cut-off as opposed to a good ranking metric, but with an LRT test I feel like it is either any of the p-values (adjusted and non-adjusted) or the stat column that I can use for prioritization. In any of these cases, the same issue remains, that these values are systematically different in one dataset (dataset1) compared to the other (dataset2), making any type of "averaging" approach driven by the "robust" and "high" values of one dataset. It will make more sense when I explain what I have tried below.
What I have tried:
- Taking the average of the p-values and sorting genes based on this (Edington method?). This suffers from the ranking issue that genes can be very much driven by being at the top of only one of the data sets (dataset1, the values are several orders of magnitude smaller), instead of genes being "robustly" significant in both as they are more toward the "end" of significant values in the other dataset (dataset2).
- Combining the p-values through other methods like Fisher's. The issue here is that the consistent data-set1 has genes with significance values smaller than R is able to represent, so they are consistently represented just as a 0 and I can't use calculations that log the data.
- A very manual Venn Diagram method, where I choose the top # of genes based on adjusted p-value in each dataset and then take the overlap of this until I end up with an overlap of 10. This would then be the "top10" genes of that cluster. The explanation feels a bit dodgy and requires a lot of manual filtering as opposed to defining one clear rule.
- My favourite method, but I just came up with this and have no clue of the validity of it. Sorting both datasets based on the adjusted p-value (could easily be the unadjusted p-value, or even the stat column, I don't have strong opinions on this) and then just creating a new column and ranking this from 1 to the END i.e the top most significant gene will have number 1 in both data-sets, regardless of the actual value of the p-value. And then just taking the average of this rank. Say the gene with rank 1 in dataset1 is the 17th gene in the other dataset2 and their average rank would be 18/2=9, so the new combined rank of this gene is 9.
The main thing here is, that I would like a systematic way of prioritizing these genes, mostly for visualization and accurate presentation, I don't plan on doing a ranked enrichment analysis or anything. The Venn-diagrammy way of taking overlaps of the top sth-sth genes in both clones is in a way an intuitive method and one I would probably go for if I would just be looking at these lists and writing gene names down with pen and paper. As the datasets have different sizes, then perhaps a more accurate Venn-diagram way would be to take the top x% of significant genes in each data set until the overlap of genes reaches my desired "Top 10 shared significant genes". Anyway, I would be extremely grateful for any input and criticism on this. There are some extra-snazzy ways of some "sequential agreement methods", but it just doesn't feel like it should be this complicated...