Hello all. I have searched the threads for help to address my specific problem, but I can only find responses that come close to answering specific questions. Please be patient if my questions appear to be similar to previous posts.
SCENARIO: I used the Tophat-Cufflinks-Cuffdiff pipeline in Galaxy to generate data for FPKM values and fold changes for 23,245 human genes. I have two equally-drug-treated, independently RNA-seq'ed samples (T1, T2) and two untreated samples (U1, U2). I am interested in quickly determining how reproducible the results between T1 and T2 may be for each gene of the 23,245 total.
APPROACH: I used Excel to make a volcano plot. My naive approach was for gene#1, use FPKM values T1:T2 (A1, B1) for the first array, and use FPKM values U1:U2 (C1, D1) for the second array, use a two-tailed t-test, and assume equal variance: =T.TEST(A1:B1,C1:D1,2,2). I ran this calculation separately for each of the 23,245 genes (using fill-down in column D). Then I plotted log(average-fold-change-of-the-two-experiments,2) vs. -log(p-value,10) for the 23,245 genes. The resulting plot looks reasonable enough.
QUESTION: Is a T-test like this appropriate for identifying genes that have 'large' (very high and very low) fold changes consistently across two experiments? Is this publishable as long as I explain what I did? If not, what is a good approach for determining genes that appear to consistently respond to the treatment? The RNA-seq will not stand alone, it is for screening purposes...candidates will be tested by qRT-PCR time-course experiments.
Thank you for your time.
Thank you for your reply. Does DEseq require starting all over again from raw data? Can any of the options you listed take cuffdiff output as input data? I ask because I spent hours deleting and changing out-dated gene annotations from my cuffdiff output. Apparently the most recent hg19 gtf file is full of out-dated gene ID's.
I am afraid you'll have to start all over from bam files... You'll have to count the reads from bam files using something like featureCount (from subRead), and then import them to limma or DEseq.
But, if you want the newest genome annotation you could use hg38 instead of hg19, but then you'll have to align your fastq files (go back even rawer).
Just for clarity of understanding, edgeR and DEseq are also parametric tests. They just use a different distribution and share information between genes to estimate variance.
In general, when I tested this using simulation data (http://www.ncbi.nlm.nih.gov/pubmed/23314327 ), a t-test usually calculates accurate p-values and gives accurate results. However, the problem with a t-test is that it will be underpowered with only two replicates because it relies on the observed variance and the variance measured with two replicates is very uncertain. It works just fine with more reps.
edgeR, DEseq, and limma look at the average variance across genes to get a better estimate of what the true variance is for each individual gene. I believe limma is non-parametric. They will usually introduce some bias into what genes are called (i.e. higher variance genes will be more likely to be called) but it's not usually enough to screw up your results. But expect to get a different dataset with your new analysis. Neither dataset is wrong - they are just identifying different subsets of your "true" deferentially expressed genes.