Hi to all community!!
First of all thank everyone that helped me!
Have a question: I have analyzed RNAseq data using TopHat, then CuffLinks and finally parsing results to R and analyzing via Limma Package. I obtained following the tutorial the TopTable with the: Name of the Gene, log2FC, Average Expression, adjusted p value, among others (t, B,...); and also computed the FPKM and FC and added to the table.
Here you can find a preview of my data analyzed:
My objective at this point is to retrieve the full list of genes (of nearly 23,000) that appear to be UP or DOWN according to fold change.
The filtering that usually I do is to set:
- adj.P.Val < 0.05 (FDR)
- FC > 1.4 (UP) or < -1.4 (DOWN)
But only applying this criteria I reach too many genes UP or DOWN (3000 up and 4000 down). Considering previous experiments in our lab I expect to find close to 2000 up/down genes.
My question is about the FPKM and if I should use it to be more stringent in my filters: do you have experience with FPKM filtering ? which threshold do you use in your experiments? What can you recommend me?
I read some literature about this and found some papers which use an FPKM > 1 together with FDR and FC, and others not use the same FPKM or even use the RPKM > 0.1 and I don't know exactly how to proceed to set my filters with more statistically significant limits for genes UP/DOWN.
Thanks to all community! Any help will be kindly appreciated! Dani.
There is no gold standard for thresholding expression. The best you can do is to try to threshold based on biological observations. One method might be to look at the expression of introns/intergenic regions that you know probably should not be expressing anything and use that as a baseline. However, even that method can quickly get complicated.
I just want to throw a caution here about your analysis: it's not clear to me how you have used the output of cufflinks to send data to R and limma. If you are using limma, hopefully you are using the
voom
function, which would require a table of read counts per gene (per expt) as input, and not FPKM values.Is that what you are using?
Hi Steve,
Yes you are right :) But My question is what to do to be stringent having FC, FPKM and Ajusted P value (FDR) to filter genes UP-regulated and DOWN-regulated.
Thanks for your comment!
To continue Steve's point, what exactly do you mean when you say
then Cufflinks and finally parse...
. How or what is Cufflinks' output doing in limma? Could you clarify?That is,
I have two timepoints: 0h and 6h of treatment.
So to continue analysis after CuffLinks the genes.fpkm.tracking files from both treatments that the program outputs was taken by the FPKM, FPKM.high and FPKM.low together with gene symbol always where FPKM status was OK.
Then joined files by Gene Symbol to obtain a file with the following columns:
Gene_symbol, FPKM(0h), FPKM.high(0h), FPKM.low(0h),FPKM(6h), FPKM.high(6h), FPKM.low(6h).
This file was used to be analyzed with Limma and the result is the Table with computations of log2FC, Average Expressions, FDR, etc...
Sorry for the bad explanation, I hope this will be clear enough :)
Thanks, I think I get it. So input to
limma
was FPKM. Right?Yes,
The input was all FPKMs in both conditions
Okay, before continuing with the original question you'd asked, the method of providing input as FPKM to limma, a package that assumes count values (and the statistical tests are designed for count data), is wrong. This is what Steve was trying to explain. Since your statistical analysis is invalid, it is difficult to tell anything about the 3000-4000 genes you have obtained. So, I suggest you re-run limma with read counts to find the differentially expressed genes correctly.
To nitpick a bit, limma does not actually assume count values - it was developed for microarray analysis, after all. What voom() is doing is preparing count data for limma by stabilizing the variance (making sure that the variance does not depend to much on the mean). If the FPKM data have the proper distribution limma should work fine, but they probably don't, so it may be safer to use counts -> voom().
Yes, thanks for pointing it out Mikael. I've been using this and edgeR exchangeably that much that I forgot the difference! :)
That's what I thought :-) edgeR and limma are very similar in some ways
Ok I understand the problem,
Since I'm new in RNA-seq, how I could obtain read counts from FPKM?, I was searching in the web and the solution is not trivial and I couldn't find any clear solution
Thanks
You can not. You'll have to obtain read counts from your sam/bam mapped file directly. As for the thresholding, people also increase the cut-off of the log2FC to >=2. Note: This is to be done after you get your analysis right.