Hello everybody,
I am fairly new to the RNA-seq workflow and I am currently struggling how to evaluate the performance of the RNA-seq pipeline I am trying to establish, which will be used to investigate differential gene expression.
Lets say I have 3 different pipelines:
1. kallisto -> tximport -> DESeq2 -> 160 differentially expressed genes (random number)
2. salmon -> tximport -> DESeq2 -> 173 differentially expresssed genes
3. STAR -> featurecounts -> DESeq2 -> 184 differentially expresssed genes
The problem is, that we do not know the "ground truth", i.e. which genes really are differentially expressed. How do I know which pipeline is performing the best? Are there any parameters i have to look out for? Furthermore, there are plenty of options within DESeq2 which influence the number of genes that are considered differentially expressed, e.g. the method used for the Log fold shrinkage (apeglm vs ashr) or the filter function (IHW vs. default).
How do I determine which options to choose?
Thank you very much, any help is appreciated!
Whats the overlap of those genes ? Look for some benchmarking papers. Based on your question, I understood that you need to spend time on reading about methods on rna-seq analysis then you will get an intuition about what works better.
My personal opinion is, Ideally, the biological conclusions shouldn't change if you change a pipeline (In this case all standard pipelines widely used). There will be some differences in differential genes, but for example, if you do a pathway enrichment analysis or GO enrichment, all should show similar enrichments. If you see huge differences with different pipelines, the data is not robust enough.
If you are interested in particular gene, and if that gene is differentially expressed only in one method, you should inspect the data a bit more and find out why its not called in other methods. It could be marginally significant in other methods, so you can go ahead with the method based on your intuition and further validate the result with other experimental approaches or other computational evidences or based on literature.