How Do You Justify Your Rna-Seq Expression Threshold (Fpkm/Rpkm) ?
5
16
Entering edit mode
11.9 years ago
biorepine ★ 1.5k

Hi, after following 4 years of literature based on RNA-Seq studies, I understood that most of the papers arbitrarily define expression threshold i.e, >1 FPKM/RPKM to identify an expressed transcript. But how can one really justify this?

rpkm rna-seq cutoff fpkm • 42k views
ADD COMMENT
9
Entering edit mode
11.9 years ago

Our lab uses spike-ins of some known RNA sequences, all at known concentrations. If the spike-in RPKM expression levels make sense, you have some evidence that RPKM for your transcripts at the same level are accurate.

Ambion ERCC spike-in controls is what we use.

ADD COMMENT
1
Entering edit mode

I think using spike-in controls is key going forward with RNA-Seq experiments. Personally I was quite irritated with the absurdly low cut-offs ENCODE has been using for calling "novel" RNAs. Levels that frankly are reflecting noise picked up by the depth of sequencing.

ADD REPLY
0
Entering edit mode

I can't agree more. However, using >1 RPKM in discovering long non-coding RNAs should be fine as they are expected to be lowly expressed.

ADD REPLY
1
Entering edit mode

Depends on what you expect that >1 RPKM to work out to in terms of expected number of transcripts/cell.

ADD REPLY
4
Entering edit mode
11.9 years ago
Gabriel R. ★ 2.9k

If I were you, I would make a density plot of the FPKM values you are getting, hopefully, you will get a distinct distribution and a reliable range for your cutoff.

ADD COMMENT
1
Entering edit mode

Still the way you choose the cutoff after plotting them is kind of arbitrary?

ADD REPLY
1
Entering edit mode

Arbitrary perhaps but at least justifiable.

ADD REPLY
0
Entering edit mode

Based on the density plot, what's your suggestion on where to assign a threshold?

ADD REPLY
1
Entering edit mode

Depends on the distribution. If you get a nice bimodal distribution, anything in between.

ADD REPLY
0
Entering edit mode

Why anything in between? Do you think the expression levels of genes at the lower peak are not trustworthy? I thought these genes are just expressed at low levels.

Thanks.

ADD REPLY
0
Entering edit mode

What values to select for making the distribution graph. I have been trying to do this but failed. Please suggest me the simplest way as I am a beginner in this area. Thank you

ADD REPLY
0
Entering edit mode

We used RSEM to align and quantify the RNA-seq levels, and use estimated gene count = 5 as the threshold -- if none of the samples has gene count >= 5, that gene is filtered out and not used for downstream analysis. Unless you have a strong reason not to do so, this filtering method should serve you well as it has done for us.

ADD REPLY
4
Entering edit mode
11.9 years ago

Although spike-ins, as mentioned, are best, if you don't have them you could look at this paper.

It outlines a procedure for setting a cutoff based on finding a good compromise between low rates of false positives and false negatives, respectively. The approach compares the observed distribution of FPKMs for transcripts in the sample with FPKMs calculated for a "negative set" of regions that lie close to annotated genes but haven't been observed to be expressed in any published experiments.

ADD COMMENT
2
Entering edit mode

Just before posting this question, I came across this paper but I was confused with the way they define false positives/negatives.

ADD REPLY
4
Entering edit mode
11.9 years ago

Using RPKM of 1 is as arbitrary as using p-value of 0.05. There are some papers that use intronic/intergenic expression as the baseline threshold. But even that can get complicated and messy.

ADD COMMENT
3
Entering edit mode
10.4 years ago
Ann ★ 2.4k

If a read exists in your RNA-Seq data set that aligns uniquely to a gene, doesn't it mean that the original RNA sample contained a transcript from that gene? The only other way to get such a read would be contamination from genomic DNA. And if you observe more than one read aligning to your gene of interest and they are clearly not PCR duplicates, then your confidence that the gene was active in your original sample would increase. However, in practice, it is very hard to work with these very low expressed genes. For example, if you try to assay their expression using qPCR, the Cq values may be so large and variable that you can't get an accurate measurement.

On the other hand, If you are doing a more genome-scale analysis, maybe because you are interested in the diversity of genes that are expressed across different sample types (e.g., pollen, roots, leaves, trichomes) then it probably makes sense to apply a cutoff. In that scenario, some libraries might seem to indicate greater diversity of gene expression only you did more sequencing and there were more chances to observe rare reads arising from less active genes.

ADD COMMENT
3
Entering edit mode

I think it can be easy to conflate "expressed" vs "expressed and with noticeable phenotype".

The transcriptional landscape is a stochastic bag of enzymes and molecules. Transcription happens randomly and everywhere. It just so happens certain places on the genome allows for more transcription. So in terms of expression, the 1 tag mapping to a transcript does mean expression, but does it mean it is affecting some kind of phenotype? That is probably what people want to know to gain some kind of biological insight. Depending on the cellular context, maybe 1 transcript is enough to cause some kind of amplification cascade to affect phenotype; or maybe at least 1 billion transcripts are needed. I don't think a global threshold can really be defined for "expression with phenotype".

ADD REPLY
0
Entering edit mode

Exactly. There is a lot of transcriptional noise. We know that random non-gene portions of the genome get transcribed at low levels. You have to establish, at the very least, a baseline threshold for clearing that noise level to even begin to say that something is biologically relevant.

ADD REPLY
0
Entering edit mode

Sorry, I forgot to mention another possible solution or reason to apply a cutoff. You may suspect your sample has some contamination. For example, your method of isolating single cell types might be imperfect. In that case, you could use reads from genes that you expect to be expressed in the contaminating cell type as a way to pick a cutoff. For example, you would not expect photosynthetic genes to be expressed in pollen, and so you use those to calibrate your cutoff. A reviewer of a pollen RNA-Seq paper I wrote suggested this idea. It made good sense to me so I included it in the final version of the paper. So there is at least one example of this idea "working" in a peer review scenario.

ADD REPLY
0
Entering edit mode

Of course the downside of that is that you may filter out potentially novel unknown functions of other genes simply because you think it is contamination.

ADD REPLY
0
Entering edit mode

Even if you do DNAse treatment you will still have some amount of noise from genomic DNA that remains. In addition to genomic DNA contamination, it could also be that the read was misaligned to that region. If the read is within the intron of a gene, it could also be signal from unprocessed RNA. I have observed instances of other contamination from the lab (e.g. genomic DNA or cDNA from another experiment) as well. Finally, as others have mentioned, transcription is a stochastic process. Every base in the genome is transcribed at some probability. Only a subset of this transcription is biologically significant to most researchers doing gene expression assays. If you have RNA-seq reads that span across what appears to be a valid splice site, this gives you a bit more confidence because exon-exon junction sequences usually do not occur in genomic DNA or unprocessed RNA.

ADD REPLY

Login before adding your answer.

Traffic: 1724 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6