Hi,
I have RNA seq data with the samples defined as follows:
ID level stability type
A1 med unstable exp
A2 med unstable exp
A3 med unstable exp
B1 med stable exp
B2 med stable exp
B3 med stable exp
C1 low stable exp
C2 low stable exp
C3 low stable exp
D1 low unstable exp
D2 low unstable exp
D3 low unstable exp
E1 high stable exp
E2 high stable exp
E3 high stable exp
F1 high unstable exp
F2 high unstable exp
F3 high unstable exp
G1 host host host1
G2 host host host1
G3 host host host1
H1 host host host2
H2 host host host2
H3 host host host2
I1 host host host3
I2 host host host3
I3 host host host3
ID column here describes the samples in 3 biological replicates. My objective is to identify DEGs while comparing different samples in pair (A vs B, B vs C, A vs D etc) as well as in group-wise comparison (stable vs unstable, host vs expressed). But when I define a contrast such as "contrast = c("stability","unstable","stable")", I get some DEGs with just one of the replicates from 3 different stable or unstable samples being high or low in comparison to others as well.
As I understand that's because when the program identifies a gene in at least 3 samples within a group of the same profile (irrespective of being just 1 of the replicates of 3 different samples) with significant difference from the rest, it reports it as a DEG. However, I would like to know if there is some parameter that can be introduced to say that more than 3 samples in a group of 9 (may be at least 6 samples in our case) are required to have concordance to be reported as DEG.
If not, then can someone kindly suggest some other way to do avoid getting DEGs not representative of entire group but just 3 samples of a big group.
Thanks !!
How did you check that?
When I plot heatmap for significantly differentially expressed genes like this:
Plotting heatmaps on non-log2 transformed counts is not recommended as high counts will dominate the scaling. Do at least
log2(x+1)
, better z-score the data. This also appears to be an eyeballing method, therefore not recommended as well as not reproducible as it lacks exact cutoffs. When you perform DESeq2 (or other tools) analysis, the significant genes are representative for the group as the group mean is significantly different than the group mean of the other group respecting the dispersion of the data. I think you worry too much. I would perform standard analysis and then take the genes significant at e.g. 5% or 1% FDR. These tools have elaborate ways to ensure that single outliers do not skew the results especially at these low sample sizes.I tried plotting z-scores after log normalisation as well. But I still get some such results. However, may be I will trust these as it is and go ahead with same result.
Thanks for your response.
Do I understand correctely that you are doing stable vs unstable, but when you look at the DEGs you see genes that are, for example, consistently high in B1, B2 and B3, but not in C1, C2, C3,E1,E2,E3?
This is not how DESeq(2)/limma/edgeR work. They will take all the samples that fall within one of your groups (say stable) and calculate the mean expression level. They then calculate a variance on that mean (this where the DESeq/edgeR/limma magic happens). They then ask if the mean of your two categories is sufficiently different compared to the variance using a negative binomial model.