I have a very basic question. In many papers and analysis we see analysis are been doing using genes having a threshold like RPKM/FPKM >1 or 3 or 5. What is this threshold? What does it mean and how do you calculate it? I'm having trouble understanding this and finding papers/articles to explain this. Any help is appreciated.
I think this a really good question that wet lab biologists care about: what threshold of RNA count (any normalized form) could lead to detectable protein expression (by western blot or flow cytometry).
The threshold itself is pretty arbitrary and should be based off of your own data. In general, what people are trying to do with this is to look at only "expressed" genes, for some hopefully reasonable meaning of expressed.
RPKM/FPKM is computed as follows:
"number of reads" / "length of gene or region in kb" / (total reads in millions)
For paired-end data, substitute "number of fragments" for reads. You can also get these values from a number of programs, such as stringTie and RSEM (I think RSEM produces them too, but don't quote me on that).
Important to remember, though, that, due to the way that these units are derived, the values are not cross comparable across samples.
To derive RPKM/FPKM expression units, samples are only normalised 'within themselves' - there is no cross-sample normalisation. Thus, due to external factors for which this normalisation method does not control, a value of 10 in one sample is not the same as 10 in another. For this reason, in addition, these units are not suitable for differential expression analysis and you should abandon their usage if your aim is to conduct differential expression.
As mentioned, the purpose is to set a cutoff for what is considered 'expressed'. This is also where the concept of TPM (transcripts per million) started becoming popular rather then RPKM/FPKM since the attempt is to quantify the expression in a complete transcript. For what is considered a good cutoff is debatable by analysis groups. The Sequence Quality Consortium (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4810084/) and (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4321899/) is an FDA-led group that was put together since pharmaceutical companies were submitting RNA-Seq results rather then microarray data as proof of expression data. This group did a fairly good assessment on the consistencies and relative cutoffs for RNA-Seq data. They reported that as low as 1 FPKM was verifiable by RT-PCR. It is also well known that variability in RNA-Seq data greatly increases the lower expression.
See for example this blog post: https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/
For a nice explanation, also see StatQuest
I think this a really good question that wet lab biologists care about: what threshold of RNA count (any normalized form) could lead to detectable protein expression (by western blot or flow cytometry).