Hi,
I am dealing with RNA Seq data, where I am studying if the expression of a set of genes is subtype specific. For this, I have plotted the expression of the genes across the different subtypes, but to make sure if the data is significant, I wish to perform a t-test on the samples where I see a difference between any two molecular subtypes.
I performed the t-test with the t.test()
function in R and I have two questions regarding the output -
First, at times it gives me an accurate p-value, for example, 3.921e-14, but at times it just says p-value< 2.2e-16. Is that the minimum p-value displayed, or can I get an accurate p-value?
Secondly, the confidence interval by default is 0.95, which I changed to 0.99, 0.999 and so on. Yet, I never seem to find any difference in the p-value. To confirm there is no change, I also tried confidence intervals of 0.5 and 0.1.
Any help or advice on the two points would be greatly appreciated.
Thanks.
Look up "edgeR" or "DESeq2" for differential gene expression testing on RNA-seq data. T-test is not appropriate in this situation, due to the way data is distributed (also, you probably need to normalize for sequencing depth between samples).
P-value of a test is not a function of the confidence interval, that is correct.
Google "2.2e-16".
Thanks for the advice on edgeR and DESeq2. I will surely look into it, but why is t-test not appropriate in this case?
Look here: http://seqanswers.com/forums/showthread.php?p=161824
In short, every test functions under certain assumptions. Breaking those assumptions breaks the test. RNA-seq data breaks the assumption of the t-test that the data is drawn from a normal distribution. In practice, you'd fail to detect true differences and may "detect" false ones too.
Imagine a data with a clump of points in one corner and an outlier in the other. A t-test, assuming the data is normally distributed, will estimate the mean to lie somewhere in between the outlier and the clump, completely misrepresenting the true distribution (which is likely around the clump). T-test is really comparing this estimated distribution to another estimated distribution, so if the estimate is faulty, so will be the result of the test.
Furthermore, both edgeR and DESeq2 have good methods for normalizing the sequencing depth between samples (FPKM, ie dividing by total number of reads and gene length is not good enough ).