RNA-seq-tutorial-for-gene-differential-expression-analysis
This tutorial is based on Bioconductor packages, RNAseq gene differential expression analysis, including filtering, preprocessing, visualization, clustering, and Enrichment. I hope it will be useful for new Bioconductor users.
Required data files You should have a raw count and annotation/metadata file for running this analysis (In this tutorial example files are provided). Raw count files are usually obtained from tools such as featureCount, RSem, etc from Bam files.
Bioconductor packages to be installed
DESeq2
edgeR
biomaRt (Useful for gene filtering and annotations)
PCAtools (PCA detailed analysis)
ReactomePA (enrichment analysis)
Note: PCA and Enrichment analysis is based on Deseq2. However, users may be interested in considering only those genes that are commonly differentially expressed between DEseq2 and EdgeR.
R script and further instructions of the tutorial are available here.
https://github.com/amarinderthind/RNA-seq-tutorial-for-gene-differential-expression-analysis
We explained various bulk RNA-Seq applications in our recent review published in Briefing in Bioinformatics. https://doi.org/10.1093/bib/bbab259
Thank you for putting this together. I have couple of things as feedback that I disagree with though:
Here you use naive CPM calculation instead of TMM (you need to run
cpm()
) on a edgeR DGEList object for whichcalcNormFactors()
has already been run, and CPMs that are used for PCA should be on the log2 scale. One typically also subsets the genes for those with large variances as a proxy for being different between samples, say the top 500 most variable genes on the log scale. That would then be consistent with this line because the DESeq2 PCA function by default uses the top 500 most variable genes, and since you use vst the data are already on the log scale. This'd make things consistent.Here I'd rather run the recommended
FilterByExpr()
filter from edgeR rather than custom filtering to remove low counts as you did on top of the script.All these three commands (the outputs) are included when running
estimateDisp
afaik.The current recommendations of the edgeR authots (based on various Bioconductor posts) is to use the QLF framework rather than the LRT, referring to here.
As said for this here, I'd rather apply
FilterByExpr
than custom approaches.In general when writing code that is intended for a tutorial you might want to check out Rmarkdown rather than a plain R script, it is really handy and the produced html reports are awesome both for display and also for code documentation.
I gave my two cents on what is my best practice for standard RNA-seq here: Basic normalization, batch correction and visualization of RNA-seq data
Thank you very much for the feedback. Yes, there were corrections on
PCAtools
related lines. After verification of estimateDisp from edgeR documents, I may includeFilterByExpr
function to edgeR related part and again I need to check what's the best way for deseq2 in that case.The DESeq2 manual explicitely states that no prefiltering is necessary. The independent filtering of the results function will take care of that internally.
I see QLF has advantages over LRT, with some exceptions. Mentioned by Aaron Lun here..... where the dispersions are very large and the counts are very small, whereby some of the approximations in the QL framework seem to fail.