Hi all,
I am new to RNA-seq analysis and I wanted to ask about comparing expression in different genes. I apologise in advance that this question is quite basic. I used Cuffdiff to perform a differential expression analysis and validated several candidate genes identified using in situ hybridisation. While the validation of most genes confirmed the expected differential expression, it was difficult to decide on the cut-off for when no expression could be expected. For instance gene A had a FPKM of 1000 in cell1 and FPKM of 90 in cell2, validation showed expression in cell1 but not cell2. However, gene B had a FPKM of 80 in cell1 and FPKM of 3 in cell2, and validation showed expression in both cell1 and 2, though it was stronger in cell1.
Since FPKM is normalised for gene length, I assumed that the FPKM of different genes should be comparable. Am I wrong in thinking this? And could there be any reason that validation using in situ hybridisation of some genes show no expression when there are transcripts according to the transcriptomic data other than the sensitivity of the probe in detecting the gene?
Thanks so much!
Thanks so much for your reply, Kevin. You make some valid points. Just a few clarifications, though. My purposes for RNA-seq was to perform a differential expression analysis between 2 group of closely related cells. While that was the main aim, now that I have the datset, I want to see what method might be best to predict if a gene is expressed or not in the transcriptome. I did HTSeq counts and analysed the data using DESeq2 as well, and I noticed the same issues I mentioned with this pipeline as well. Since the count data does not take into consideration the gene length or the sequencing depth, which were different in the samples I thought it was easier to make a comparison of the expression of different genes in the same sample based on the FPKM (but I guess with the issues with normalisation used to calculate FPKM this is not accurate).
Relating my question with the count matrix generated by HTSEq and the DESeq2 analysis, how can you decide what number of counts you'll be able to assume is negligible expression (not biologically relevant)? 2 genes X and Y have approximately 2000bp. Gene X had counts of 5-54 in one sample (with 5 replicates) and Gene Y had counts of 20-100 in the same sample (with 5 replicates). Gene X was expressed when validated with in situ hybridisation while Gene Y was not expressed - do you think it's more related to sensitivity of the RNA-seq vs in situ hybridisation for detecting the genes or does it point to a problem with my data set. The replicates were multiplexed and sequenced in the same run but each replicate had different sequencing depth (which I've read is quite common). Also just to mention I am working with low-RNA (of RIN>8) quantities - 2ng (2000g) of RNA was used for cDNA synthesis and amplification and subsequent library synthesis.
Thanks so much!
I see, your aim is to literally just determine expressed and non-expressed. Given that you have FPKM, you could transform the data to the Z scale using the zFPKM function in R, which has actually been received positively, from what I have seen so far. Going by the Z-scale, you will then have a more intuitive way of gauging expressed / non-expressed because it may then be as simple as:
Coincidentally, it is through this logic that some have been developing cellular deconvolution methods from RNA-seq data.
Thanks so much for the suggestion - I will try that
Hi Kevin,
I have a large dataset of RNAseq with FPKM values, as you said, I can't use it for differential gene expression profiling but on the other hand, what is the best way to see which gene is high or low expressed?
maybe a linear regression? at least to have an approximation
thanks
Hey Morris, I would explore the transformation of these counts via the zFPKM package in R. It is not the end of the World if you just have FPKM, though. What is your study? - disease versus control?
From the RNA-seq analysis of my experiment, I found about 400 genes that are more expressed. I downloaded a large dataset of RNA-seq (FPKM) from patients with cancer and I want to see which of my 400 genes are more expressed in this patients?
but still I'm not sure how to split low versus high expressed gene in the FPKM matrix
Thanks
So, you downloaded the TCGA data? :) If you just want to determine
expressed
|not expressed
, then having FPKM is not so bad.With FPKM, just use 10 as the lower threshold, i.e., anything that passes 10 is expressed at some level. The upper threshold is more difficult to determine with FPKM...
If you transform to Z-scale via the zFPKM package (recommended), then use <= -3 as not expressed, and >= +3 as highly expressed
Ultimately, there is no correct or incorrect answer.
yes I downloaded TCGA data Why 10 is considered the threshold?
Thank you : )
10 is just a nice number because we use the decimal system of numbers. If we had only 8 fingers (4 on each hand), then maybe I would choose 8 as the cutoff. 10 is also the number mentioned by my colleague, who [in part] developed zFPKM.