Hello! I am currently analyzing some data that was provided to me. It consists of more than 150 microarray samples from a single kind of tumor. The data is already normalized and background corrected, and in theory I should be able to find differences in gene expression among patients that live more than the others under chemotherapy treatment and those that live less under the same treatment.
The thing is that I can't seem to find grouping based on the median survival (high vs low) or even survival divided in terciles (again high vs low) when in theory there should be differences. The differential expression analysis with limma does not find any DEGs after correcting with fdr. After filtering for most variable genes, I have tried adjusting for hidden batch effects with SVA, and correcting for relevant clinical variables (age, performance status, volume of disease) with no improvement.
I fitted adjusted cox ph models with each gene but none were related to survival after adjusting for multiple comparisons.
I have also tried k-means clustering with two groups, and they correlate with survival in the adjusted Cox model, which gets me differentially expresses genes, but I am not sure this approach is correct.
I do not understand how with such a big sample size, previous research backing my assumptions and supposedly quality data I get no results in the differential expression analysis.
I have performed GSEA with the ranked genes to get significantly expresses pathways, but I do not get significant results after adjusting for multiple corrections.
Any idea where I might be going wrong? Thank you all!
Let the data speak. If there's no difference, then there's no difference.
Human clinical data is always messy and unpredictable, especially when it comes to cancer (which is probably the most genomic messiest group of diseases out there). Something that might be true of one patient cohort might not be true of another patient cohort. Every tumor is different, every patient is different (they are human beings -- e.g. I can't tell you why I might live longer or shorter than you), every human responds to drugs differently, dissection procedures introduce a ton of variability, there are always errors in diagnoses and sample metadata, etc. -- there's a very high noise-to-signal ratio especially if you don't matched normal tissue controls.
Remember: Your analysis is an "experiment", not a "positive control".
Negative results are still useful though!
If you're sure of the quality of the data (i.e. the upstream processing) and your analysis then the result is what it is. What may remain is to question your expectation that there must be a difference with this approach. What is the quality of the previous research you mention? Was it based on a handful of cases or thousands? Was the data different/handled differently? Another related questions is what is a differentially expressed gene, e.g. what is the threshold? The clustering approach is valid but isn't the same as more typical statistical methods like limma. With clustering, you detect global patterns from which you can find features/genes that contribute more to the data split. However in the case where n<<p as is typically the case with microarray, you need to be careful with the high dimensionality. In essence, most distance/similarity measures become meaningless leading to spurious clustering although this is mitigated if there's strong structure in the data (e.g. very well separated clusters).
There are good answers here already which I would follow, but I also would expect to see differences (at least confounders like age, gender etc). What do you see in heatmaps ? Visualizing the data can help a lot.
Also, are there datasets with other types of tumour which show this exact median survival effect and correlate this with gene expression differences ?
I have mostly had experience with RNA-seq, but at times some DE approaches failed/found no signficant DE genes (such as limma-voom, though it was preferable for most samples), while others sometimes worked better (edgeR). This tended to be dataset dependent, so trying out another package might be a good solution.
Great suggestion on visualizing the data as well as doing DE between variables that you know should result in different gene expression profiles. I like it.
I would caution against trying out different DE packages until you find one that gives you genes with adjusted p<0.05. That can be a form of p-value hacking. In any case, limma is the state-of-the-art for microarray DE and has been well-benchmarked on hundreds of datasets. limma has always worked great for me for both microarrays and proteomics.
If the FDR is killing your results, it's ok to filter out genes that are not informative https://www.bioconductor.org/help/course-materials/2003/Milan/Lectures/Filtering.pdf