I'm dealing with data contains 47 tumor and 5 normal samples. Aim is to find upregulated genes in tumor. Before doing a differential analysis I made a clustering heatmap to check how well samples are clustered.
For clustering:
As I have simple counts (featureCounts) data, I transformed the data into vsd matrix using deseq2.
From vsd_matrix I took top 10% highly variable genes for visualization.
vars <- apply(vsd_matrix, 1, IQR)
set <- vsd_matirx[vars > quantile(vars, 0.9), ]
With this I calculated z-score and plotted the data Clustering heatmap [In the heatmap annotation blue color is normal and brown is tumor]
From the heatmap I see that some of the tumor samples are not clustered well with other. Tumor samples are formed into two clusters.
I removed two normals which show some very bad library sizes for the further analysis.
When I did differential analysis on all those 47 tumor and 3 normal, among the differential expressed genes I see only 4 upregulated in tumor.
But when I did differential analysis (DEA) b/w 3 normal and 35 tumor samples which formed into a cluster, I found apprx 30 upregulated genes.
In the same way I did DEA b/w 3 normal and 12 tumor which formed into another cluster, I found around 60 upregulated.
Why different results with different analysis? Do I need to remove some tumor samples for DEA based on clustering?
Any help is appreciated
My big question for you is this: how are you conducting differential expression analysis? From a literal interpretation of your text, one would assume that you directly transformed your raw featureCounts counts via the variance-stabilsation transformation (VST) of DESeq2, which is of course an incorrect procedure, and that you are then possibly conducting differential expression on the VST-derived counts, which again is not correct. Can you clarify?
Other issues include the large imbalance between tumour and normal (sample n), but this should not necessarily negate the conduction of the differential expression analysis.
The other things that you're doing, i.e., filtering your samples and then re-generating p-values: It is perfectly logical that you'll then obtain different p-values by doing this. This happens due to any one or more of so many reasons, including the alteration of the background data distribution, the inadvertent selection of a particular sub-type of cancer, etc. It is not exactly a good procedure to do, by the way, without major justification for filtering the samples in this way.
Finally, depending on your cancer type, it is logical that tumours would not cluster together. Apart from the fact that each tumour cell is different, it is recognised that many cancers are sub-divided into main molecular types. For example, breast cancer is divided by IHC based on ER, PGR, and HER2 status, and has known molecular sub-types, too (luminal A/B, basal, triple-negative, etc.).
Added note: be cautious, in addition, of how you define 'normal' in the context of cancer. If you have 'normal' tissue that was merely extracted from the surrounding tumour, then it is most likely not normal at all and will have a cancer-like profile.
Hi Kevin,
Thanks for the reply. Only for the part of clustering, I used vsd_matrix. For differential analysis I didn't use VST derived counts. I understand that tumor samples would not cluster together. But When I use all the tutor samples together I see very less upregulated genes. All those are normal tissues without any tumor. What all options should I consider to filter out some samples?
Your data normalisation should be conducted on the entire cohort, i.e., all samples. By, thereafter, including certain categorical variables into your DESeq2 design model, you can then derive p-values for comparisons between different sub-groups.
May I ask what is the ultimate aim of your study? Is the data from the TCGA or is it your own data?
This my own data. I did normalisation on whole cohort only and I'm using edgeR with logFC 1.2 and FDR < 0.05.
Do you think t-test can be used for differential analysis instead of edgeR or deseq2.
If you normalise your data via EdgeR or DESeq2, then you should be using the statistical tests provided by those tools. This is also what the developers of those tools would tell you.
"data normalisation should be conducted on the entire cohort, i.e., all samples. By, thereafter, including certain categorical variables into your DESeq2 design model, you can then derive p-values for comparisons between different sub-groups."
this brings me the question ,lets say here in this case its tumor vs normal so its easy to test the hypothesis like how different from the normal and get FC and p-value etc etc.So what about multiple cell types ,is there any way to run differential analysis together over it, like if I have a
stem cell, progenitor and mature cell
,i would like to know how each stage changes in terms of differential expression, so what i been doing , i compare each successive stage with the previous stage and see the result, this is kind of safe option not to add ambiguity in the data but is it the right way to perform ?The other approach i follow is normalize everything in deseq2 and go for various kinds of clustering ..