Question

How to tell if a gene is over- or under-expressed from the output of F test (differential gene expression)?

0

Entering edit mode

8.0 years ago

StephanieK ▴ 110

Hi all,

I have a set of normalised, log2 transformed gene expression data for two age groups. I performed a linear regression on each gene, and used an F test to determine whether the slope of the curve between the two age groups is different than zero, which in this case would indicate that there is an association between the expression signal and age.

This is a sample of my input:

Gene    Age2    Age2    Age2    Age4    Age4    Age4    Age6    Age6    Age6
0610005C13Rik 0.50  0.88  0.14  0.48  0.096 0.66 0.14 0.13 0.48
0610006L08Rik 0.085 0.055 0.024 0.02  0.44  0.10 0.02 0.06 0.14
0610007P14Rik 0.84  0.94  1.07  0.79  0.96  0.99 0.95 0.80 0.86
0610009B22Rik 1.1   0.99  1.29  1.31  0.96  1.23 1.27 0.83 1.0
0610009L18Rik 0.83  0.91  0.99  0.62  1.09  0.62 1.49 0.78 1.18

This is a sample of my output:

(Intercept) age pval
0610005C13Rik   -8.75673954644126e-17    1                  6.22604270595892e-113
0610006L08Rik   0.160549044129198       -0.13300883967093   0.470334747352544
0610007P14Rik   0.929818139418018       -0.0345393343981819 0.794039611790303
0610009B22Rik   1.10956301739983         0.0143433761977127 0.952572869214549

My problem is that I don't understand the output, specifically, which genes are over-and under-expressed. Could someone explain (1) Why the first gene has a significant p value, but a "age" variable of 1 (this happens in all of my files, all of the differential output files; the first genes has a age variable = 1) and (2) How to tell from this example which genes are over- and under-expressed (although I can see from the P values that it won't be significant)?

Thanks

f test gene R linear regression • 4.3k views

ADD COMMENT • link updated 8.0 years ago by Devon Ryan 104k • written 8.0 years ago by StephanieK ▴ 110

0

Entering edit mode

Could you post what method you used to arrive to these results? (actual code would help). You mention two age groups but I see 3 in your example data (Age2, Age4 and Age6).

ADD REPLY • link 8.0 years ago by ddiez ★ 2.0k

0

Entering edit mode

Thanks, sorry I have multiple data sets, both with two and more than two age groups. I have switched to limma and posted the input file, the code and the output below, if you had any thoughts I'd appreciate it.

ADD REPLY • link 8.0 years ago by StephanieK ▴ 110

score 1 · Answer 1 · 2016-11-15

1

Entering edit mode

8.0 years ago

Devon Ryan 104k

If the first value in the file for age is always 1 then you likely have a problem with how you're processing the data or (more likely) writing the output. Regardless, your method is underpowered, and I would strongly advise you to use the limma package instead. Not only will you (A) get results that you know are formatted correctly and (B) are better powered, but also (C) the reviewers won't rip apart your methodology.

ADD COMMENT • link 8.0 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks. I've switched to limma. I have a gene expression matrix of normalised gene expression values. There is >26,000 genes, start of the file is like this:

Gene    Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
Gene1   5.20    5.05    5.01    4.96    5.31    4.83
Gene2   4.85    4.59    4.91    5.06    4.75    4.86
Gene3   7.68    7.30    7.58    7.64    7.45    7.69

I have a phenotype File (age 2.5 months and 20 months, and I want to find genes that are DE with age):

Sample  Age
Sample1 Age2.5
Sample2 Age2.5
Sample3 Age2.5
Sample4 Age20
Sample5 Age20
Sample6 Age20

This case I am comparing two age groups, other data sets I am comparing multiple age groups.

I wrote this code to run the eBayes model in limma:

library(limma)
library(Biobase)
expr <-as.matrix(read.table("Table.tabbed",header=TRUE,sep="\t",row.names=1,as.is=TRUE))
expr
minimalSet <-ExpressionSet(assayData=expr)
pData <-read.table("pData",row.names=1,header=TRUE,sep="\t")
metadata <-data.frame(labelDescription=c("Age"),row.names=c("Age"))
phenoData <-new("AnnotatedDataFrame",data=pData,varMetadata=metadata)
exampleSet <-ExpressionSet(assayData=expr,phenoData=phenoData,annotation="Mouse430_2")

#limma
f <-factor(as.character(exampleSet$Age))
design <-model.matrix(~f)
fit <-eBayes(lmFit(exampleSet,design))
fit$t
fit$p.value

The output:

An object of class "MArrayLM"
$coefficients
              (Intercept)   fAge5
Gene1      122.74   -7.14
Gene2       17.76    7.98
Gene3     2749.66 -140.96
26676 more rows ...

$rank
[1] 2

$assign
[1] 0 1

$qr
$qr
   (Intercept)      fAge5
1   -3.16 -1.58
2    0.31 -1.58
3    0.31 0.24
4    0.31  0.24
5    0.31  0.24
6    0.31 -0.39

attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$f
[1] "contr.treatment"


$qraux
[1] 1.31 1.24

$pivot
[1] 1 2

$tol
[1] 1e-07

$rank
[1] 2


$df.residual
[1] 8 8 8 8 8
26676 more elements ...

$sigma
Gene1 Gene2 Gene3 Gene4 Gene5 
     76.52      18.88  315.83     225.08      60.69
26676 more elements ...

$cov.coefficients
            (Intercept) fAge5
(Intercept)         0.2  -0.2
fAge5              -0.2   0.4

$stdev.unscaled
              (Intercept)     fAge5
Gene1   0.44 0.63
Gene2   0.44 0.63
Gene3   0.44 0.63
26676 more rows ...

$pivot
[1] 1 2

$Amean

Gene1 Gene2 Gene3 Gene4 Gene5 119.17 21.75 2679.18 3109.49 544.06 26676 more elements ...

$method
[1] "ls"

$design
   (Intercept) fAge5
1            1     1
2            1     1
3            1     1
4            1     1
5            1     1
6            1     0

attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$f
[1] "contr.treatment"


$df.prior
[1] 1.109467

$s2.prior
[1] 1365.136

$var.prior
[1] 0.011 0.011

$proportion
[1] 0.01

$s2.post
Gene1 Gene2 Gene3 Gene4 Gene5
    5308.61      479.43    87768.19    44660.61     3401.09
26676 more elements ...

$t
              (Intercept)      fAge5
Gene1    3.76 -0.15
Gene2    1.81  0.57
Gene3   20.75 -0.75
Gene4   27.48 7.66
Gene5   24.27 -4.82
26676 more rows ...

$df.total
[1] 9.10 9.10 9.10 9.10 9.10
26676 more elements ...

$p.value
               (Intercept)        fAge5
Gene1 4.34e-03 8.8e-01
Gene2 1.02e-01 5.7e-01
Gene3 5.5e-09 4.7e-01
Gene4 4.4e-10 2.9e-05
Gene5 1.36e-09 9.08e-04
26676 more rows ...

$lods
              (Intercept)     fAge5
Gene1   -4.45 -4.60
Gene2   -4.54 -4.60
Gene3   -4.34 -4.60
Gene4   -4.33 -4.48
Gene5   -4.34 -4.50
26676 more rows ...

$F
[1]   13.38    5.09 409.20 1111.87  446.79
26676 more elements ...

$F.p.value
[1] 1.94e-03 3.26e-02 1.20e-09 1.30e-11 8.10e-10
26676 more elements ...

ADD REPLY • link 8.0 years ago by StephanieK ▴ 110

1

Entering edit mode

Make sure that the phenoData is really matching what your design should be. It looks a bit like it's putting 5 samples in one group and one sample in another, when I think you have 2 groups of 3 members each.

BTW, you might find the topTable function a bit more convenient than parsing through fit.

ADD REPLY • link 8.0 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks, I had actually made a small mistake, this is the output (in the previous, I just replaced the gene names with fake names to save space). Can I ask what variable you were looking at that let you know that? The output is now:

An object of class "MArrayLM"
$coefficients
              (Intercept)     fAge2.5
0610005C13Rik    5.037427  0.05310667
0610006L08Rik    4.895797 -0.10814000
0610007P14Rik    7.596513 -0.07157667
0610009B22Rik   11.256733  0.10033333
0610009L18Rik   10.056347 -0.06747333
26676 more rows ...

$rank
[1] 2

$assign
[1] 0 1

$qr
$qr
  (Intercept)    fAge2.5
1  -2.4494897 -1.2247449
2   0.4082483 -1.2247449
3   0.4082483  0.2898979
4   0.4082483 -0.5265986
5   0.4082483 -0.5265986
6   0.4082483 -0.5265986
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$f
[1] "contr.treatment"


$qraux
[1] 1.408248 1.289898

$pivot
[1] 1 2

$tol
[1] 1e-07

$rank
[1] 2


$df.residual
[1] 4 4 4 4 4
26676 more elements ...

$sigma
0610005C13Rik 0610006L08Rik 0610007P14Rik 0610009B22Rik 0610009L18Rik 
   0.19048153    0.16261838    0.17039991    0.05204972    0.15951263 
26676 more elements ...

$cov.coefficients
            (Intercept)    fAge2.5
(Intercept)   0.3333333 -0.3333333
fAge2.5      -0.3333333  0.6666667

$stdev.unscaled
              (Intercept)   fAge2.5
0610005C13Rik   0.5773503 0.8164966
0610006L08Rik   0.5773503 0.8164966
0610007P14Rik   0.5773503 0.8164966
0610009B22Rik   0.5773503 0.8164966
0610009L18Rik   0.5773503 0.8164966
26676 more rows ...

$pivot
[1] 1 2

$Amean
0610005C13Rik 0610006L08Rik 0610007P14Rik 0610009B22Rik 0610009L18Rik 
     5.063980      4.841727      7.560725     11.306900     10.022610 
26676 more elements ...

$method
[1] "ls"

$design
  (Intercept) fAge2.5
1           1       1
2           1       1
3           1       1
4           1       0
5           1       0
6           1       0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$f
[1] "contr.treatment"


$df.prior
[1] 3.984665

$s2.prior
[1] 0.01396661

$var.prior
[1] 1145.58956   38.60707

$proportion
[1] 0.01

$s2.post
0610005C13Rik 0610006L08Rik 0610007P14Rik 0610009B22Rik 0610009L18Rik 
   0.02514634    0.02021766    0.02151584    0.00832708    0.01971646 
26676 more elements ...

$t
              (Intercept)    fAge2.5
0610005C13Rik    55.02144  0.4101637
0610006L08Rik    59.63739 -0.9314651
0610007P14Rik    89.70069 -0.5976377
0610009B22Rik   213.66168  1.3466183
0610009L18Rik   124.04702 -0.5885228
26676 more rows ...

$df.total
[1] 7.984665 7.984665 7.984665 7.984665 7.984665
26676 more elements ...

$p.value
               (Intercept)   fAge2.5
0610005C13Rik 1.373041e-11 0.6924705
0610006L08Rik 7.226661e-12 0.3789160
0610007P14Rik 2.788625e-13 0.5666409
0610009B22Rik 2.735258e-16 0.2150759
0610009L18Rik 2.098760e-14 0.5724488
26676 more rows ...

$lods
              (Intercept)   fAge2.5
0610005C13Rik    17.55056 -6.541068
0610006L08Rik    18.19526 -6.178009
0610007P14Rik    21.24966 -6.439954
0610009B22Rik    25.79853 -5.731014
0610009L18Rik    23.31567 -6.445677
26676 more rows ...

$F
[1]  3059.443  3478.926  7970.757 46060.024 15284.766
26676 more elements ...

$F.p.value
[1] 3.035238e-12 1.818368e-12 6.658023e-14 6.061865e-17 4.953270e-15
26676 more elements ...

Could I ask about my second comment above just to check I'm interpreting the output properly? Will look into TopTable for sure. Thanks a mil for the help.

ADD REPLY • link 8.0 years ago by StephanieK ▴ 110

2

Entering edit mode

That looks much better :)

I was looking at design. Note that you're now comparing Age2.5 vs. Age25 in such a way that a positive coefficient corresponds to a decrease in expression with age. Perhaps that's what you want, but my guess is that you want to change the base level of the Age factor in your pheno data (this is the single most confusing part about factors in R!) to be Age2.5 rather than Age25.

Which comment is your second one (sorry, I have a bunch going on in parallel, so I sometimes lose track of what I've responded to and what I haven't)?

ADD REPLY • link 8.0 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you so much for your help.

I know how to relevel factors, I've done it in the Cox Model before, but I'm actually having trouble with it here. My total code is:

library(limma)
library(Biobase)

expr <-as.matrix(read.table("Table.tabbed",header=TRUE,sep="\t",row.names=1,as.is=TRUE))
minimalSet <-ExpressionSet(assayData=expr)
pData <-read.table("pData",row.names=1,header=TRUE,sep="\t")
metadata <-data.frame(labelDescription=c("Age"),row.names=c("Age"))
phenoData <-new("AnnotatedDataFrame",data=pData,varMetadata=metadata)
#phenoData
exampleSet <-ExpressionSet(assayData=expr,phenoData=phenoData,annotation="Mouse430_2")


#limma
f <-factor(as.character(exampleSet$Age))
design <-model.matrix(~f)
fit <-eBayes(lmFit(exampleSet,design))
topTable(fit,p.value=0.05)

I know that in the phenoData line, I just need to say somehow "relevel(Age,"Age2.5") / relevel("Age","fAge2.5") / etc", but no matter where I put it in the code I get an error, if you could advise I would appreciate it.

I've implemented TopTable(), and I can see the output for the above code is (not accounting for relevelling):

Removing intercept from test coefficients
                  logFC   AveExpr         t      P.Value    adj.P.Val        B
1500015O10Rik  2.07  6.05  26.89 4.04e-09 0.0001 9.47
Cpne2         -2.84  9.77 -24.17 9.38e-09 0.0001 9.08
Mettl21e       1.79  7.89  23.16 1.31e-08 0.0001 8.91
Rhpn2         -1.65  6.33 -20.20 3.85e-08 0.0002 8.32
Itpr1         -1.87 10.34 -19.69 4.70e-08 0.0002 8.202
Kera           3.62  7.74  17.57 1.14e-07 0.00051 7.63

So can I confirm, looking at the first gene, what this is saying is that for gene 1500015010Rik, the logFC is the fold change, and adj P value is p value, so this says that this gene is 2-fold (significantly) over expressed in Age2.5 compared to Age25 (because I haven't worked out how to relevel yet).

Thanks again for your help, I am genuinely so appreciative, you were so clear.

ADD REPLY • link 8.0 years ago by StephanieK ▴ 110

0

Entering edit mode

I found out how to relevel, finally! I was putting it in the wrong place, for everyone else: add the line

f <-relevel(f,"Age2.5")

under the line:

f <-factor(as.character(exampleSet$Age))

So then if I do that, my output is:

Removing intercept from test coefficients
                  logFC   AveExpr         t      P.Value    adj.P.Val        B
1500015O10Rik -2.076969  6.054590 -26.89394 4.044968e-09 0.0001079238 9.476454
Cpne2          2.847223  9.777422  24.17557 9.383927e-09 0.0001166984 9.086132
Mettl21e      -1.796220  7.890103 -23.16905 1.312152e-08 0.0001166984 8.917099
Rhpn2          1.659970  6.332522  20.20268 3.853648e-08 0.0002508156 8.320838
Itpr1          1.873408 10.346146  19.69755 4.700265e-08 0.0002508156 8.202075
Kera          -3.629143  7.749602 -17.57019 1.149173e-07 0.0005110182 7.633771

So then can I confirm that this means, for example for 15000015010Rik, the adjusted p value is <0.05 and the logFC is -2.07, so this means that this gene is 2.07-fold under-expressed with age.

ADD REPLY • link 8.0 years ago by StephanieK ▴ 110

0

Entering edit mode

2^2.07 lower, so 4x lower. Otherwise you got it exactly :)

ADD REPLY • link 8.0 years ago by Devon Ryan 104k

0

Entering edit mode

This looks interesting. Does this code, and limma, work the same way for multiple phenotype groups instead of a pairwise? What is there was an extra 4 samples of a third bracket e.g. Age50?

ADD REPLY • link 8.0 years ago by simonbyrneucl • 0

0

Entering edit mode

Sure, you can do whatever sort of comparisons you want. Imbalanced group sizes aren't a problem.

ADD REPLY • link 8.0 years ago by Devon Ryan 104k

0

Entering edit mode

So can I ask: 1. How could you tell my previous method was underpowered (i.e. from the intercept or age variable?), because I'll need to work out some how how my current model fits the data/is the test is well powered enough, so if you knew where I should be looking to find this info.

2.Do you agree that I should use the t test results (i.e. $t and $p.value in above output) for when there is two age groups, and $F and $F.p.value for when there is more than two age groups.

So then can I check Gene4 and Gene5 are the two significantly differentially expression genes (2.9e-05 and 9.08e-04 from the $p.value variable) and then I use the $t variable to tell me whether the gene is over or under expressed based on the plus/minus, so in this case gene4 is 7.6-fold over expressed and gene5 is 4.82 times under expressed in the Age5 month category compared to the Age20 month category?

Then conversely, since I want to look at what happens with age, I want my younger age group to be the base to which the older group is compared, so I can say that Gene4 is 7.6-fold under expressed and gene5 is 4.82 times over expressed in the Age20 month category compared to the Age5?

Thanks so much for the help.

ADD REPLY • link 8.0 years ago by StephanieK ▴ 110

2

Entering edit mode

I could tell from the methodology. Have a read through the original paper on limma, which goes into detail early on about the benefit of pooling information across samples.
I can't encourage parsing through fit, please use topTable to get a better summary, which will include the adjusted p-values. There is never any reason to try to directly find T or F statistics, all that matters are the fit coefficients, the adjusted p-values, and the coefficient SEs (one can, of course, compute T and F statistics from that). I should note that it's important to see what exactly the F values in fit are: "numeric vector of moderated F-statistics for testing all contrasts defined by the columns of ‘fit’ simultaneously equal to zero." Whether this actually answers the biological question you want to answer or not will be case dependent. The limma user guide has some example of multifactorial and time course experiments that you might want to flip through.

Regarding if gene4 is DE, if we ignore that these are unadjusted p-values then yes it's DE. For the direction of change, you'd look in the coefficients member. Again, though, this is much easier to get from topTable(), where you'll find everything already together.

ADD REPLY • link 8.0 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks so much, so I've edited the code slightly to include comments etc for other people to understand, I just have one question:

This is the code (for everyone else):

library(limma)
library(Biobase)

#Make an expression data set
expr <-as.matrix(read.table("Table.tabbed",header=TRUE,sep="\t",row.names=1,as.is=TRUE))
minimalSet <-ExpressionSet(assayData=expr)
pData <-read.table("pData",row.names=1,header=TRUE,sep="\t")
metadata <-data.frame(labelDescription=c("Age"),row.names=c("Age"))
phenoData <-new("AnnotatedDataFrame",data=pData,varMetadata=metadata)
exampleSet <-ExpressionSet(assayData=expr,phenoData=phenoData,annotation="Mouse430_2")

#Check normalised data by plotting a box plot
boxplot(as.data.frame(exprs(exampleSet)))


#Do limma
f <-factor(as.character(exampleSet$Age))
f <-relevel(f,"Age2.5")
design <-model.matrix(~f)
fit = eBayes(lmFit(exampleSet,design))
topTable(fit,p.value=0.05)

#Make a volcano plot: 
volcanoplot(fit,highlight=10,coef=2)

I'm finding it difficult to understand the "design <-model.matrix(~f)". I understand there are two ways to make the model. I know from reading that one involves not having an intercept and fitting contrasts "manually". Since I am not picking -1 or 0 in the model.matrix, I am keeping the intercept. Since I have an intercept, I will not need to fit a contrast since it is implicit in the model. I don't fully understand all of this, but I think that's the difference anyway. I've seen more papers that explicitly do the contrasts method than do my method (without explicitly doing the contrasts). Why would this be?

Is there a guideline for how to make the model (i.e. specifically write the "design" line/include contrasts) e.g. say I have two age groups, or more than two age groups, how I should write the "design" line, or is there a test you do that will tell you what way of making the model based on your data?

I also tried to include the contrasts manually (for this code I currently get the error lmFit -> lm.series -> lm.fit):

#Do limma
f <-factor(as.character(exampleSet$Age))
f <-relevel(f,"Age2.5")
sample=c("Age2.5","Age20")
design =model.matrix(~0+factor(sample))
colnames(design) = c("Age2.5","Age20")
fit=lmFit(exampleSet, design=design)

contast.matrix = makeContrasts(Age2.5-Age20,levels=design)
fit2=contrasts.fit(fit, contrast.matrix)
eb = eBayes(fit2)
topTable(eb,p.value=0.05)

#Make a volcano plot
volcanoplot(fit,highlight=10)

ADD REPLY • link 8.0 years ago by StephanieK ▴ 110

1

Entering edit mode

I guess one benefit of always using a contrast is that you can more easily set whether you want Age2.5 - Age20 or Age20 - Age2.5, without mucking with factor levels.

Whether to use an intercept or not depends mostly on how you think about the experiment and which parameterization more easily allows answering the biological question. As an example, suppose you have the common four group experiment: "WT + vehicle", "mutant + vehicle", "WT + drug", and "mutant + drug". If you're interested in the interaction between genotype and treatment, then using a factorial design (~genotype+treatment+genotype:treatment) is the simplest method (you could do this with the contrast method, but this would be more confusing). If, instead you care about differences due to treatment independently within each genotype, then you're life is simplest by not specifying an intercept and instead directly using contrasts.

ADD REPLY • link 8.0 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you so much. I got advice from someone who knows more than me, and she said: "It's best for the linear model to set the intercept. If you set an intercept you are saying you expect the line to go through the origin (which isn't normally the case for microarray data). Generally the intercept is of little interest in microarray data. If you look at the design matrix if should be a binary matrix 0,1. You can make this manually if it's easier for you. So it would be Age1. Age2. Age3 Sample 1 1 0 0 Sample 2. 0 1 1 "

So could I clarify (with someone else who knows more than me!) that she's saying to do it the way I originally did it, i.e., not setting contrasts, and using this code:

f <-factor(as.character(exampleSet$Age))
f <-relevel(f,"Age2.5")
design <-model.matrix(~f)
fit = eBayes(lmFit(exampleSet,design))
topTable(fit,p.value=0.05)

Which is allowing the linear model to set the intercept.

When i ran just this part of the code:

f <-factor(as.character(exampleSet$Age))
f <-relevel(f,"Age2.5")
design <-model.matrix(~f)
design

And had a look at the output:

  (Intercept) fAge20
1           1      0
2           1      0
3           1      0
4           1      1
5           1      1
6           1      1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$f
[1] "contr.treatment"

I think this is what she is saying? I have a binary matrix that's dividing the samples into their age brackets (I'm not sure what to do about the fact that in my output, I also have an intercept column with all 1's that she didn't have in her example, we were specifically talking about limma).

Thanks so much.

ADD REPLY • link 8.0 years ago by StephanieK ▴ 110

1

Entering edit mode

She's certainly agreeing that it's good here to use ~f as the design, but I can interpret her understanding of how intercepts are created in a couple different ways. Practically speaking, the results will end up being exactly the same in your case regardless of whether you use an intercept or not. If you use ~0 + f, then what was the intercept is now relabeled as fAge2.5 but nonetheless has exactly the same value for each of the genes (in this case, the fAge20 coefficient is added to whatever the (Intercept) coefficient for each gene became) . This is why I said at the end of the day it doesn't really matter whether you use something like ~f or ~0+f, provided it makes answering the questions you have simple.

As an aside, if you really want to learn exactly how these model matrices are used, then you might want to take a class on linear algebra. In hindsight, it turns out to have been one of the more useful classes I took as a biologist (if only I had paid better attention during it!). You might also find a nice video tutorial on this on youtube, since this is the sort of thing that's easier to understand when worked out on a white board with an example than in flat text (at least I find texts on this subject more confusing than helpful, but maybe that's just how I think).

Anyway, you're on the right track to high quality results :)

ADD REPLY • link 8.0 years ago by Devon Ryan 104k

0

Entering edit mode

i am curious of your opinion that says "there is never any reason to try and find t or f statistics, all that matter are the fit coefficients and p values". It is interesting because I can see some places such as this seem to use the linear regression coefficients and obtain p values from these, whereas other places like this say to use an F test; do you have an opinion on when to use which one? Maybe this is a general algebra question and not gene expression differential expression related.

ADD REPLY • link 8.0 years ago by Tom ▴ 50

2

Entering edit mode

Generally an "F test" in this context denotes comparing some model versus ~1, which is often not exactly biologically informative (unless you're doing a time-course experiment and want to ask, "does time even matter?"). In most cases, however, we want to know both what effect has a change with some significance threshold (p-value), but also the magnitude of that effect (fit coefficient). Those combined with the coefficient variance (the coefficient and its variance naturally dictate the p-value) are actionable values, since you can use them to determine whether it's worth while expending more resources on a line of experimentation or R&D.

ADD REPLY • link 8.0 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you. So since there are so many different ways to do this analysis, I checked with someone exactly how I should do it, would this make sense to you? I have some data sets: the phenotypic data is different age categories (sometimes two age categories, sometimes >2; the aim is to find DE genes between the different ages). I was told, regardless of whether there are two age groups, or more than two age groups, do a two-tailed F test to check if the slope of the curve is different from zero, which would indicate an association between expression signal and age.

So I combined your advice and was doing some reading; for two age groups (e.g. here and here for anyone else), I think this is how I do it:

If I have a sample group like this with two age groups (i.e. Table.pData):

Sample1 Young
Sample2 Young
Sample3 Young
Sample4 Young
Sample5 Aged
Sample6 Aged
Sample7 Aged

I run this:

library(limma)
library(Biobase)
expr <-as.matrix(read.table("Table",header=TRUE,sep="\t",row.names=1,as.is=TRUE))
minimalSet <-ExpressionSet(assayData=expr)
pData <-read.table("Table.pData",row.names=1,header=TRUE,sep="\t")
metadata <-data.frame(labelDescription=c('Age'),row.names=c('Age'))
phenoData <-new("AnnotatedDataFrame",data=pData,varMetadata=metadata)
exampleSet <-ExpressionSet(assayData=expr,phenoData=phenoData,annotation="HG-U133_Plus_2")
f <-factor(as.character(exampleSet$Age))
f <-relevel(f,"Young")
design <-model.matrix(~f)
fit = eBayes(lmFit(exampleSet,design))
topTable(fit)

I get this output (Sample output, numbers not real):

           logFC   AveExpr         t      P.Value   adj.P.Val        B
Gene1  2.4 10  7.8 3.0e-07 0.007 5.3
Gene2  4.4  20  7.5 5.1e-07 0.007 5.0
Gene3   5.0  30  6.8 2.1e-06 0.01 4.0
Gene4   1.9  40  6.7 2.2e-06 0.01 3.9

This means: Compared to my base age, i.e. "Young", these four genes have these Log2 fold changes (are all over expressed in the old category compared to young). Is the P value, the p value from the estimated t statistic? I've read here that the T stat and the F stat are the same for two groups of pooled samples. Then the adjusted p value is the adjusted p val of the T test(?) and the B value is the log odds that the gene is differentially expressed, which most people seem to ignore.

Then, if I have more than 2 age groups:

e.g. Table.pData

Sample1 Age3.0
Sample2 Age3.0
Sample3 Age3.0
Sample4 Age6.0
Sample5 Age6.0
Sample6 Age6.0
Sample7 Age9.0
Sample8 Age9.0
Sample9 Age9.0

and I run the code:

library(limma)
library(Biobase)
expr <-as.matrix(read.table("Table",header=TRUE,sep="\t",row.names=1,as.is=TRUE))
minimalSet <-ExpressionSet(assayData=expr)
pData <-read.table("Table.pData",row.names=1,header=TRUE,sep="\t")
metadata <-data.frame(labelDescription=c('Age'),row.names=c('Age'))
phenoData <-new("AnnotatedDataFrame",data=pData,varMetadata=metadata)
exampleSet <-ExpressionSet(assayData=expr,phenoData=phenoData,annotation="RAE230A")
f <-factor(as.character(exampleSet$Age))
f<-relevel(f,"Age3.0")
design <-model.matrix(~f)

fit <-lmFit(exampleSet,design)
fit <-eBayes(fit)
topTable(fit)

The output is like this (again these numbers are false):

          fAge12.0  fAge23.0   fAge6.0   fAge9.0   AveExpr        F PVal AdjustedPVal
Npc2     3.2 1.1 2.9 3.9 2.4 5.7 0.01 0.04
Gpnmb     4.3  1.3  3.1 2.3  2.5 4.2 0.02 0.05
Tmem176a 3.2  3.9  7.5  9.3  8.8 3.6 0.03  0.05

This is telling me, compared to my base age (Age3.0), the log2 fold change of each age group compared to the base age group (all genes over expressed in all age cats compared to base). And then again, is the P value, the P value of the F stat, which is checking whether the slope of the different age groups is different from 0 from the F test?

Thanks again for your time.

ADD REPLY • link 8.0 years ago by StephanieK ▴ 110

1

Entering edit mode

tldr: Yes, the F and AdjustedPVal are what you want in the topTable output.

Ah, if you have a time course experiment then yes, comparing to ~1 (aka, doing an F test) can be somewhat informative. In that case, you'll need to get fit$F and fit$F.p.value (due adjust this for multiple comparisons with p.adjust()!). For the two group case you can skip this (as you found F is T^2).

When you have multiple groups and don't tell topTable() which results you want, then the F statistic and p-values are for whatever model you gave it versus ~1 (i.e., the intercept only model, which is basically a "1-way ANOVA", if that's helpful). Biologically, this is asking "forget individual effects between any group(s), is any of this grouping worthwhile or is there just no change due to anything?"