When use RNA-Seq data to do Differentially Expressed Genes (DEG) analysis, should the sample (/replicate) numbers of two groups must be the same? For example, if I have 8 samples from control, and 5 samples from treatment group, is it OK to use DESeq to do DEG analysis?
I'm using HISAT2 and featureCouts, after that, got the counts files, before putting them into DESeq, should I do normalisation firstly or can I use them directly?
I am expecting you are using DESeq2 and not DESeq.
As far as numbers in each group is concerned, it is pretty fine to
perform DE analysis. The ideal scenario you get equal samples that
are paired and you need to use that feature while performing DE
analysis with any standard tool like DESeq2, edgeR or Limma.
DESeq2 can still perform DE analysis with just 2 samples in one group
and 3 in the other. That's the lowest limit, going lower than that
the results are usually not trusted worthy.
I have read edgeR can do even with lesser samples in the group but I
do not trust such analysis tbh. Your number of samples per group is
pretty good to perform the tests.
About the normalization. The DE tools I mentioned and also you put in
query work on count data. So there is no point of putting normalized
data in them. They will perform normalization in the subsequent
steps. Just prepare your count table well and follow the DESeq2
tutorial and you are good to go.
I will advice to follow the tutorial pretty well before performing
any DE analysis. It is always good to understand how the data behaves,
not only a QC ploy but also a good practice in exploratory data analysis.
Gives an understanding why you need to perform DE analysis and which samples
should be included in it. There might be a scenario where you might
have to move 1 or 2 samples from either of the group if they behave
as outliers, owing to either batch or sequencing errors, even if you
take care of them using batch correction methods. So a complete
workflow is advised and also enables you to make a pipleline for discovery which you might be using more often in your lab setting. I hope this was informative for your query.
It is always good to maintain same number of samples in both the comparison group. In some cases it is bit complicated to get equal number in the comparison groups. My opinion is, it is completely fine to do differential expression analysis using unequal number of comparison groups.
HISAT2 -> featureCounts, choice of tools seems to be good if you are doing gene level differential expression (DE) analysis. And you do not have perform normalization as most of the DE tools will not work with normalized values.
Sufficiently complete to be an answer, moved.