The paper I refer to is: https://doi.org/10.1038/s41467-018-08083-z
A couple of snippets from the results & methods section:
"We first adopt a highly stringent pair-wise comparison method5, in which a gene with a much higher RPKM (reads per kilobase million) value in one domain than all the other 8 domains (log2FC > 1, false discovery rate (FDR) < 0.05) is considered as domain specific."
"Reads were mapped to the Arabidopsis Information Resource TAIR10 reference genome build with TopHat2 (version 2.0.9) and BOWTIE (version 2.1.0) allowing up to two mismatches68 after filtering the low-quality reads (PHRED quality score < 20). The gene locus expression levels were calculated based on mapping outputs after removing reads mapped to ribosomal RNAs and transfer RNAs using Cuffdiff2 (version 2.1.1)69, and expression levels were normalized to the RPKM unit using edgeR with significant expression cutoff value set to RPKM > 1. Differential expression was assessed with edgeR and the cutoff value was >2-fold change in expression with Benjamini–Hochberg adjusted FDR < 0.01. We used Cuffdiff2 to quantify the abundance of annotated isoforms. For the identification of domain-specifically expressed genes, three methods were performed independently. Pair-wise comparison was carried out using the differential expression assessed with edgeR."
I will address my queries as parts:
1) As far as I understand, edgeR uses TMM to normalise across libraries and fits the normalised raw counts to a NB distribution (I hope I got that part right) before testing for significance across samples and/or treatment. From the methods section in the above paper, all I gather is, the authors have somehow integrated the use of RPKM along with the edgeR package. Could someone clarify this for me?
2) And as far as I have been able to tell from the plethora of posts here and elsewhere, there doesn't seem to be any situation where FPKM/RPKM are to be used for sound qualitative/quantitative analysis of RNA seq data. Am I wrong to say this? Or are there situations where converting to FPKM units would give us some that edgeR and deseq2 don't?
3) Is this a suitable extension of what the authors of edgeR originally intended the package to be used for? I ask because I am in the middle of some analysis myself and this paper serves as one reference point for analysis.
4) Is it proper to use FPKM after normalising to library sizes (as done in edgeR)? I refer to What the FPKM? A review of RNA-Seq expression units (https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/)
The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments.
.... 4) does this imply that post between-sample normalisation, one could possibly utilise FPKM values for analysis? This seems to me what the authors might have done in the paper.
Out of interest, how did you fit this curve? I mean just a histogram and by-eye cutoff or something automated like mclustBIC?)
I simply run
plot(density(log10(fpkm)))
in R. In my experience a bimodal curve is generated with a trough right near 1 FPKM. I always use this as a sanity check, I do have several datasets where the trough appears new 2 or 3 FPKM and adjust my filtering criteria accordingly.I've gone through the exercise once or twice to get an exact minimum value, but it's so close to the integer value that I round (it also makes the methods section a bit more intuitive than using a cutoff of 1.034 FPKM).