Question

Rnaseq Processed Data To Genes Up/Down

0

Entering edit mode

12.0 years ago

daniel.soronellas ▴ 330

Hi to all community!!

First of all thank everyone that helped me!

Have a question: I have analyzed RNAseq data using TopHat, then CuffLinks and finally parsing results to R and analyzing via Limma Package. I obtained following the tutorial the TopTable with the: Name of the Gene, log2FC, Average Expression, adjusted p value, among others (t, B,...); and also computed the FPKM and FC and added to the table.

Here you can find a preview of my data analyzed:

Here you can find a preview:

My objective at this point is to retrieve the full list of genes (of nearly 23,000) that appear to be UP or DOWN according to fold change.

The filtering that usually I do is to set:

adj.P.Val < 0.05 (FDR)
FC > 1.4 (UP) or < -1.4 (DOWN)

But only applying this criteria I reach too many genes UP or DOWN (3000 up and 4000 down). Considering previous experiments in our lab I expect to find close to 2000 up/down genes.

My question is about the FPKM and if I should use it to be more stringent in my filters: do you have experience with FPKM filtering ? which threshold do you use in your experiments? What can you recommend me?

I read some literature about this and found some papers which use an FPKM > 1 together with FDR and FC, and others not use the same FPKM or even use the RPKM > 0.1 and I don't know exactly how to proceed to set my filters with more statistically significant limits for genes UP/DOWN.

Thanks to all community! Any help will be kindly appreciated! Dani.

fpkm • 7.8k views

ADD COMMENT • link updated 12.0 years ago by biorepine ★ 1.5k • written 12.0 years ago by daniel.soronellas ▴ 330

2

Entering edit mode

There is no gold standard for thresholding expression. The best you can do is to try to threshold based on biological observations. One method might be to look at the expression of introns/intergenic regions that you know probably should not be expressing anything and use that as a baseline. However, even that method can quickly get complicated.

ADD REPLY • link 12.0 years ago by Damian Kao 16k

1

Entering edit mode

I just want to throw a caution here about your analysis: it's not clear to me how you have used the output of cufflinks to send data to R and limma. If you are using limma, hopefully you are using the voom function, which would require a table of read counts per gene (per expt) as input, and not FPKM values.

Is that what you are using?

ADD REPLY • link 12.0 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

Hi Steve,

Yes you are right :) But My question is what to do to be stringent having FC, FPKM and Ajusted P value (FDR) to filter genes UP-regulated and DOWN-regulated.

Thanks for your comment!

ADD REPLY • link 12.0 years ago by daniel.soronellas ▴ 330

0

Entering edit mode

To continue Steve's point, what exactly do you mean when you say then Cufflinks and finally parse.... How or what is Cufflinks' output doing in limma? Could you clarify?

ADD REPLY • link 12.0 years ago by Arun 2.4k

0

Entering edit mode

That is,

I have two timepoints: 0h and 6h of treatment.

So to continue analysis after CuffLinks the genes.fpkm.tracking files from both treatments that the program outputs was taken by the FPKM, FPKM.high and FPKM.low together with gene symbol always where FPKM status was OK.

Then joined files by Gene Symbol to obtain a file with the following columns:

Gene_symbol, FPKM(0h), FPKM.high(0h), FPKM.low(0h),FPKM(6h), FPKM.high(6h), FPKM.low(6h).

This file was used to be analyzed with Limma and the result is the Table with computations of log2FC, Average Expressions, FDR, etc...

Sorry for the bad explanation, I hope this will be clear enough :)

ADD REPLY • link 12.0 years ago by daniel.soronellas ▴ 330

0

Entering edit mode

Thanks, I think I get it. So input to limma was FPKM. Right?

ADD REPLY • link 12.0 years ago by Arun 2.4k

0

Entering edit mode

Yes,

The input was all FPKMs in both conditions

ADD REPLY • link 12.0 years ago by daniel.soronellas ▴ 330

1

Entering edit mode

Okay, before continuing with the original question you'd asked, the method of providing input as FPKM to limma, a package that assumes count values (and the statistical tests are designed for count data), is wrong. This is what Steve was trying to explain. Since your statistical analysis is invalid, it is difficult to tell anything about the 3000-4000 genes you have obtained. So, I suggest you re-run limma with read counts to find the differentially expressed genes correctly.

ADD REPLY • link 12.0 years ago by Arun 2.4k

3

Entering edit mode

To nitpick a bit, limma does not actually assume count values - it was developed for microarray analysis, after all. What voom() is doing is preparing count data for limma by stabilizing the variance (making sure that the variance does not depend to much on the mean). If the FPKM data have the proper distribution limma should work fine, but they probably don't, so it may be safer to use counts -> voom().

ADD REPLY • link 12.0 years ago by Mikael Huss 4.8k

0

Entering edit mode

Yes, thanks for pointing it out Mikael. I've been using this and edgeR exchangeably that much that I forgot the difference! :)

ADD REPLY • link 12.0 years ago by Arun 2.4k

0

Entering edit mode

That's what I thought :-) edgeR and limma are very similar in some ways

ADD REPLY • link 12.0 years ago by Mikael Huss 4.8k

0

Entering edit mode

Ok I understand the problem,

Since I'm new in RNA-seq, how I could obtain read counts from FPKM?, I was searching in the web and the solution is not trivial and I couldn't find any clear solution

Thanks

ADD REPLY • link 12.0 years ago by daniel.soronellas ▴ 330

1

Entering edit mode

You can not. You'll have to obtain read counts from your sam/bam mapped file directly. As for the thresholding, people also increase the cut-off of the log2FC to >=2. Note: This is to be done after you get your analysis right.

ADD REPLY • link 12.0 years ago by Arun 2.4k

score 2 · Answer 1 · 2012-12-26

2

Entering edit mode

12.0 years ago

biorepine ★ 1.5k

Forget about these microarray R packages, when you using cufflinks output. I strongly suggest you to use cuffdiff to calculate differential expression of two time points. If you want to use R for further downstream analysis, apply cummerbund.

ADD COMMENT • link 12.0 years ago by biorepine ★ 1.5k