Regarding the approach to classify tumor samples to subtypes based on different gene sets with gene expression data, my question is: what is the significance of the approach when not all the genes of the gene sets are found in the samples? I believe that it may affect significantly the results, but still this is not reported when results are presented, for example when doing GSVA. This happened to me quite often, in particular when using data from TCGA or ICGC. So far, I have only done that using GSVA package, with method 'gsva'. Has anyone done testing on this issue?
Thanks
Note: cross-posting to Bioconductor
Hi Kevin
I agree with what you say.
I have encountered this problem especially when using custom gene sets created downstream of a particular pipeline, and used with an expression dataset obtained with a different pipeline. As you say, including or not different biotypes makes the difference.
I'd like to have other opinions and I would like to maybe test the statistical power loss.
In that case, you could start by exploring the relationship between statistical power and the commonly used tests in gene enrichment, i.e., Chi-square (χ2), Fisher’s exact test, and hypergeometric test. However, this now branches completely into statistics, in which case I encourage you to pursue the issue on https://stats.stackexchange.com/: