I have a group of genes (belonging to a certain development pathway, eg. known to increase trichome number) and their expression data from RNA seq. I have the log2 fold changes of each sample (3 mutants) relative to the control (wildtype) as computed by the edgeR package.
I have been tasked with creating a single score which indicates whether the trichome development genes as a group are:
1) differentially regulated between mutants (log2 fold > 1.5),
2) whether this score correlates with the observed phenotype and
3) does the score take into account the number of differentially regulated and statistically significant (FDR < 0.05) genes that contribute to the score.
I have the following number of genes that fit the first two criteria in 3 mutants: 20 genes, 130 genes and 145 genes. I have calculated a score using these as follows:
I scaled the gene expression data (comprised of values ranging from -6.0 to +4.0 to) to lie between 0-1 and then computed their geometric means and this gives me scores of 0.0004, 0.0021 and 0.02 and these three correlate very well with the observed phenotype (barely any trichomes, a few and a lot of trichomes for mutant 1,2 & 3).
I have three problems, however:
a) is there a better way to scale the numbers such that they don't lead to a small score (0.0004/0.0021)?
b) I'm at a loss as to how to account for the vastly different number of genes contributing to the above scores (i.e. 20, 130 and 145).
c) Is there a way to assess how good this score is in some statistical manner?