I'm working with rnaseq data of breast cancer samples. There are a total of 40 samples. Among 40 samples, 26 samples are of subtype A and 14 are subtype B.
I did differential analysis with samples between Subtype A and Subtype B with edgeR. Differentially expressed genes are based on FDR < 0.05
The heat map looks like below:
Column annotation colors -
Orange color is Subtype A
Darkgreen color is Subtype B
I see that among 26 samples of subtype A, 9 samples are clustered but are away from other 17 samples. You can see that clearly in the heat map.
I also made a MDS plot. In the below MDS plot I made a circle where the 9 samples of Subtype A are close to the samples of Subtype B.
What I should do now if the differential analysis heatmap looks like above?
Is removing those 9 samples from the analysis just based on clustering a good idea? If not any suggestions please.
There is maybe some batch effect .. Are these samples sequenced on the same sequencing run ? Were the RNA library prepared at the same time ? Same lib kit ? Same RNA extraction method ?
Cancer subtypes can be a difficult thing. Many cancer types can be broken down into quite distinct sub-subtypes. Its also possible that what clinicians have used to assign subtypes is not as clear cut as they would like. For example, in endometrial cancer samples can traditionally be typed on two different systems - Type I vs Type II, and endometrioid vs Serous. However we find that the Type I/II classification doesn't make much sense from a molecular point of view.
The question is, what are you trying to gain from this analysis. If you want to know "what are the average gene expression differences between samples with these two different classes", then use the DE and don't worry about the heatmap (I'm not really a fan of using heatmaps just because you need to have a figure of some sort).
On the otherhand, if you are interested in discovering the hetrogenity underlying cancer, of which the current subtyping schemes are but one example, then I would start with some sort of clustering, identify clusters, and then identify the genes driving the clusters (either from the gene dendrogram, or by doing DE from the de novo identified clusters). You might then find that one of the clusters corresponds to a traditional "sub-class".
Thanks for the answer. Have a similar doubt. What answer I can give if the DEA heatmap looks like above, where 9 samples of subtype A are away from the main cluster?
There is nothing wrong with the DE list - it is still the genes that are on average different between subtype A and subtype B, its just that subtype A and subtype B might not be the most useful thing.
You could check that the subtype annotation is correct. For example, if this were receptor + vs tripple negative breast cancer, you could look at the expression of the hormone receptors in each of the samples to ensure that receptor + samples havn't been annotated as triple negative, or vice-versa.
Your heatmap is difficult to interpret, but it's seems to be fine.
As mentioned by i.sudbery, you have certainly 2 sub-subtypes into the subtype A.
A good way to verify this point is to display the dendrogram on column and ask to split your column using 3 or 4 clusters.
The order of the clusters is arbitrary (you can swap them), you mainly need to verify if each cluster are composed by the same subtype.
Below, an example build with R and pheatmap, with batch and subgroup track on column. We can see 4 subgroups with several sub-subgroup highlighted by 11 clusters.
There is maybe some batch effect .. Are these samples sequenced on the same sequencing run ? Were the RNA library prepared at the same time ? Same lib kit ? Same RNA extraction method ?