Comparing FPKM values in different genes
1
3
Entering edit mode
6.2 years ago
snp87 ▴ 80

Hi all,

I am new to RNA-seq analysis and I wanted to ask about comparing expression in different genes. I apologise in advance that this question is quite basic. I used Cuffdiff to perform a differential expression analysis and validated several candidate genes identified using in situ hybridisation. While the validation of most genes confirmed the expected differential expression, it was difficult to decide on the cut-off for when no expression could be expected. For instance gene A had a FPKM of 1000 in cell1 and FPKM of 90 in cell2, validation showed expression in cell1 but not cell2. However, gene B had a FPKM of 80 in cell1 and FPKM of 3 in cell2, and validation showed expression in both cell1 and 2, though it was stronger in cell1.

Since FPKM is normalised for gene length, I assumed that the FPKM of different genes should be comparable. Am I wrong in thinking this? And could there be any reason that validation using in situ hybridisation of some genes show no expression when there are transcripts according to the transcriptomic data other than the sensitivity of the probe in detecting the gene?

Thanks so much!

RNA-Seq validation of genes • 7.9k views
ADD COMMENT
0
Entering edit mode

Thanks so much for your reply, Kevin. You make some valid points. Just a few clarifications, though. My purposes for RNA-seq was to perform a differential expression analysis between 2 group of closely related cells. While that was the main aim, now that I have the datset, I want to see what method might be best to predict if a gene is expressed or not in the transcriptome. I did HTSeq counts and analysed the data using DESeq2 as well, and I noticed the same issues I mentioned with this pipeline as well. Since the count data does not take into consideration the gene length or the sequencing depth, which were different in the samples I thought it was easier to make a comparison of the expression of different genes in the same sample based on the FPKM (but I guess with the issues with normalisation used to calculate FPKM this is not accurate).

Relating my question with the count matrix generated by HTSEq and the DESeq2 analysis, how can you decide what number of counts you'll be able to assume is negligible expression (not biologically relevant)? 2 genes X and Y have approximately 2000bp. Gene X had counts of 5-54 in one sample (with 5 replicates) and Gene Y had counts of 20-100 in the same sample (with 5 replicates). Gene X was expressed when validated with in situ hybridisation while Gene Y was not expressed - do you think it's more related to sensitivity of the RNA-seq vs in situ hybridisation for detecting the genes or does it point to a problem with my data set. The replicates were multiplexed and sequenced in the same run but each replicate had different sequencing depth (which I've read is quite common). Also just to mention I am working with low-RNA (of RIN>8) quantities - 2ng (2000g) of RNA was used for cDNA synthesis and amplification and subsequent library synthesis.

Thanks so much!

ADD REPLY
0
Entering edit mode

I see, your aim is to literally just determine expressed and non-expressed. Given that you have FPKM, you could transform the data to the Z scale using the zFPKM function in R, which has actually been received positively, from what I have seen so far. Going by the Z-scale, you will then have a more intuitive way of gauging expressed / non-expressed because it may then be as simple as:

  • Z-score = 0 is expressed
  • Z-score > 3 = highly expressed
  • Z-score < -3 = not expressed

Coincidentally, it is through this logic that some have been developing cellular deconvolution methods from RNA-seq data.

ADD REPLY
0
Entering edit mode

Thanks so much for the suggestion - I will try that

ADD REPLY
0
Entering edit mode

Hi Kevin,

I have a large dataset of RNAseq with FPKM values, as you said, I can't use it for differential gene expression profiling but on the other hand, what is the best way to see which gene is high or low expressed?

maybe a linear regression? at least to have an approximation

thanks

ADD REPLY
0
Entering edit mode

Hey Morris, I would explore the transformation of these counts via the zFPKM package in R. It is not the end of the World if you just have FPKM, though. What is your study? - disease versus control?

ADD REPLY
0
Entering edit mode

From the RNA-seq analysis of my experiment, I found about 400 genes that are more expressed. I downloaded a large dataset of RNA-seq (FPKM) from patients with cancer and I want to see which of my 400 genes are more expressed in this patients?

but still I'm not sure how to split low versus high expressed gene in the FPKM matrix

Thanks

ADD REPLY
1
Entering edit mode

So, you downloaded the TCGA data? :) If you just want to determine expressed | not expressed, then having FPKM is not so bad.

With FPKM, just use 10 as the lower threshold, i.e., anything that passes 10 is expressed at some level. The upper threshold is more difficult to determine with FPKM...

If you transform to Z-scale via the zFPKM package (recommended), then use <= -3 as not expressed, and >= +3 as highly expressed

Ultimately, there is no correct or incorrect answer.

ADD REPLY
0
Entering edit mode

yes I downloaded TCGA data Why 10 is considered the threshold?

Thank you : )

ADD REPLY
1
Entering edit mode

10 is just a nice number because we use the decimal system of numbers. If we had only 8 fingers (4 on each hand), then maybe I would choose 8 as the cutoff. 10 is also the number mentioned by my colleague, who [in part] developed zFPKM.

ADD REPLY
3
Entering edit mode
6.2 years ago

Hey,

RPKM / FPKM are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

Cuffdiff is also old. HISAT2 / StringTie are the upgrades to the older TopHat / Cufflinks pipeline.

------------------------------------------

For instance gene A had a FPKM of 1000 in cell1 and FPKM of 90 in cell2, validation showed expression in cell1 but not cell2. However, gene B had a FPKM of 80 in cell1 and FPKM of 3 in cell2, and validation showed expression in both cell1 and 2, though it was stronger in cell1.

This is exactly the consequence of the normalisation process that produces FPKM counts: a value of 80 in one sample may mean something entirely different from 80 in another sample due to the way in which the data is normalised. In extreme cases, 80 could mean very high expression in one sample but virtually nil in the other. However, the statistical tests cannot make this distinction. This also has a direct consequence when setting minimal thresholds, i.e., for expressed / not expressed.

Unfortunately, FPKM data is still widely used and appears in publications, which is an argument always used to defend its usage by those who are unaware of its pitfalls.

----------------------------------------------

If you have RNA-seq data, then please use a better tool for differential expression analysis, like DESeq2, EdgeR, or LImma-Voom.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 1682 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6