Question

T-test for RNA-seq FPKM values in Excel

0

Entering edit mode

9.3 years ago

khaynes ▴ 50

Hello all. I have searched the threads for help to address my specific problem, but I can only find responses that come close to answering specific questions. Please be patient if my questions appear to be similar to previous posts.

SCENARIO: I used the Tophat-Cufflinks-Cuffdiff pipeline in Galaxy to generate data for FPKM values and fold changes for 23,245 human genes. I have two equally-drug-treated, independently RNA-seq'ed samples (T1, T2) and two untreated samples (U1, U2). I am interested in quickly determining how reproducible the results between T1 and T2 may be for each gene of the 23,245 total.

APPROACH: I used Excel to make a volcano plot. My naive approach was for gene#1, use FPKM values T1:T2 (A1, B1) for the first array, and use FPKM values U1:U2 (C1, D1) for the second array, use a two-tailed t-test, and assume equal variance: =T.TEST(A1:B1,C1:D1,2,2). I ran this calculation separately for each of the 23,245 genes (using fill-down in column D). Then I plotted log(average-fold-change-of-the-two-experiments,2) vs. -log(p-value,10) for the 23,245 genes. The resulting plot looks reasonable enough.

QUESTION: Is a T-test like this appropriate for identifying genes that have 'large' (very high and very low) fold changes consistently across two experiments? Is this publishable as long as I explain what I did? If not, what is a good approach for determining genes that appear to consistently respond to the treatment? The RNA-seq will not stand alone, it is for screening purposes...candidates will be tested by qRT-PCR time-course experiments.

Thank you for your time.

excel rna-seq statistics • 8.9k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.3 years ago by khaynes ▴ 50

Ram · Answer 1 · 2016-01-21

3

Entering edit mode

9.3 years ago

Benn 8.4k

Hi,

Best way to handle RNAseq data for differential expression analysis is limma-voom, edgeR or DEseq, etc. Parametric test, like your t-test seem not appropriate, you'll need eBayes estimation of your variance.

There are a lot of post about limma, edgeR, or DEseq. Also very good manuals and userguides are provided by the packages.

Last point, I don't think excel is doing a correction for multiple testing? Something you should do when performing 23k tests.

Good luck!

ADD COMMENT • link 9.3 years ago by Benn 8.4k

0

Entering edit mode

Thank you for your reply. Does DEseq require starting all over again from raw data? Can any of the options you listed take cuffdiff output as input data? I ask because I spent hours deleting and changing out-dated gene annotations from my cuffdiff output. Apparently the most recent hg19 gtf file is full of out-dated gene ID's.

ADD REPLY • link 9.3 years ago by khaynes ▴ 50

0

Entering edit mode

I am afraid you'll have to start all over from bam files... You'll have to count the reads from bam files using something like featureCount (from subRead), and then import them to limma or DEseq.

But, if you want the newest genome annotation you could use hg38 instead of hg19, but then you'll have to align your fastq files (go back even rawer).

ADD REPLY • link 9.3 years ago by Benn 8.4k

0

Entering edit mode

Just for clarity of understanding, edgeR and DEseq are also parametric tests. They just use a different distribution and share information between genes to estimate variance.

In general, when I tested this using simulation data (http://www.ncbi.nlm.nih.gov/pubmed/23314327 ), a t-test usually calculates accurate p-values and gives accurate results. However, the problem with a t-test is that it will be underpowered with only two replicates because it relies on the observed variance and the variance measured with two replicates is very uncertain. It works just fine with more reps.

edgeR, DEseq, and limma look at the average variance across genes to get a better estimate of what the true variance is for each individual gene. I believe limma is non-parametric. They will usually introduce some bias into what genes are called (i.e. higher variance genes will be more likely to be called) but it's not usually enough to screw up your results. But expect to get a different dataset with your new analysis. Neither dataset is wrong - they are just identifying different subsets of your "true" deferentially expressed genes.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.3 years ago by Michele Busby ★ 2.2k

Ram · Answer 2 · 2016-01-21

1

Entering edit mode

9.3 years ago

khaynes ▴ 50

Hello all. I tried cuffdiff in Galaxy again, and figured out the right settings to use when the two samples have multiple replicates. My volcano plot looks much better. The p-values may not mean a lot, still, but the data now agree with some RT-PCR data I did under the same conditions. Thank you for your input!

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by khaynes ▴ 50