I've analyzed my RNA-seq data using DESeq2 and now I am trying to see whether some gene of interest are expressed in our dataset. I found this paper which uses RPKM cut off of 0.3 to defined genes as expressed. I found out that I can use the fpkm function in DESeq2 to convert the counts to fpkm/rpkm. However, reading some posts in here, it seems that rlog or vst normalized counts are recommended instead. Are there any common rule of thumb for the rlog or vst-normalized counts to define expressed genes? Thanks in advance!
it seems that rlog or vst normalized counts are recommended instead
These are recommended when you need to compare gene expression between conditions. To see whether a gene is expressed within a single condition, RPKM/FPKM are fine IMHO.
Why don't you convert them to TPM and use a global threshold? I noticed that "When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample.", when reading this blog post.
This is a very "philosophical" question, but to cut it short, many people in the past have used a minimum threshold of FPKM >= 1. This value is extremely eye-metric and not based on robust statistics.
It will, however, eliminate a lot of noise, and it will be easy to reference why you chose that value (e.g. "because many others did (ref1, ref2, ref3, ..., refn).
My rule of thumb is FPKM >= 1, that being said I always generate a density plot of the FPKMs to view the distribution - in R I run plot(density(log2(fpkm))). This (almost always) generates a bimodal distribution and I've found that 1 is very close to the trough of that distribution. However, there are exceptions and in those cases I use the empirical cutoff, which I can defend using that density plot.
The thing about RPKM is it does't mean anything outside of the context of its samples. One sample could have very low read count, and an RPKM of 1 might mean only 1 read, while another sample could have an RPKM of 1, and that represents a hundred reads. 1 read for a gene is noise, 100 reads is not. An RPKM threshold of 0.3 might work for them with their coverage, but that might be very wrong for your samples if your coverage is very different.
TPM is going to have the same problem, it's going to flatten out how many raw counts you actually have for that sample and gene.
Also note that that paper is 10 years old, only published about 2 years after Illumina started putting sequencers in the field. They aligned with Bowtie! The original DESeq paper wasn't published for another year.
rlogs and vsts stay closer to the original raw count. If a sample has a vst corrected value of 5 for a gene, that doesn't mean there are 100 reads. It means there are less than 10.
These are recommended when you need to compare gene expression between conditions. To see whether a gene is expressed within a single condition, RPKM/FPKM are fine IMHO.
Why don't you convert them to TPM and use a global threshold? I noticed that "When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample.", when reading this blog post.
Previous discussion about defining expression cutoffs: TPM values of expressed genes