Entering edit mode
9.9 years ago
Mo
▴
920
I have tired to find Up and down regulated based on Limma in R.
What I have done is as follows:
library(limma)
fit <- lmFit(data, design)
# Moderated t-statistic
fit2 <- eBayes(fit)
t<- topTable(fit2)
results <- decideTests(fit2)
summary(results)
vennDiagram(results)
plotMA(fit2, 2)
volcanoplot(fit2)
Now I have the following output
x1 x2 AveExpr F P.Value adj.P.Val
205476_at 1.0335848 2.777778e-06 0.9319210 31.17153 6.807618e-14 1.516533e-09
201110_s_at 1.9916715 -3.703704e-06 1.7957690 29.82112 2.446732e-13 2.725293e-09
212158_at 1.0642415 -3.703704e-06 0.9595617 28.11326 1.239096e-12 9.201115e-09
205239_at 1.5099863 9.259259e-07 1.3614631 26.50244 5.747837e-12 3.201114e-08
212154_at 0.8753344 -4.629630e-06 0.7892355 25.99093 9.364945e-12 4.172458e-08
214211_at 0.9923502 -2.136625e-17 0.8947420 24.90243 2.650229e-11 8.014586e-08
213988_s_at 0.7521288 -9.259259e-07 0.6781488 24.83824 2.818056e-11 8.014586e-08
200748_s_at 0.8140374 -3.703704e-06 0.7339678 24.76978 3.008813e-11 8.014586e-08
201109_s_at 1.4810858 2.777778e-06 1.3354055 24.69309 3.237926e-11 8.014586e-08
202444_s_at -0.7800713 -2.059028e-01 -0.7235957 24.49669 3.907505e-11 8.704748e-08
How can I now find which genes are up and which ones are down regulated?
if I perform without my design, I will get something like below
fit <- lmFit(data)
and the rest as above
t
logFC AveExpr t P.Value adj.P.Val B
205476_at 0.9319210 0.9319210 7.479942 1.518325e-13 3.382372e-09 19.96770
201110_s_at 1.7957690 1.7957690 7.317002 4.884719e-13 5.440844e-09 18.82518
212158_at 0.9595617 0.9595617 7.105471 2.153154e-12 1.598861e-08 17.37602
205239_at 1.3614631 1.3614631 6.899900 8.775398e-12 4.887239e-08 16.00487
212154_at 0.7892355 0.7892355 6.833298 1.372724e-11 6.075399e-08 15.56854
202444_s_at -0.7235957 -0.7235957 -6.806994 1.636324e-11 6.075399e-08 15.39729
214211_at 0.8947420 0.8947420 6.689331 3.564056e-11 9.541769e-08 14.63867
213988_s_at 0.6781488 0.6781488 6.680742 3.770674e-11 9.541769e-08 14.58377
200748_s_at 0.7339678 0.7339678 6.671567 4.004358e-11 9.541769e-08 14.52520
201109_s_at 1.3354055 1.3354055 6.661281 4.283238e-11 9.541769e-08 14.45962
I have 1000 controlled and 100 untreated samples , while I have over 30000 of probes.
I made my design like this
design <- t(xor(t(replicate(2, c(rep(TRUE, 1000), rep(FALSE, 100)))), c(0, 1))) * 1L
topTable()
with the options you showed produced that output.coef=1
.Yes, I updated the question
That's still highly unlikely to be the
topTable()
command that produced that output.@Devon Ryan I just repeated the procedure and this is what I got, I am sure of it. Why is it surprising?
By the way when I try to make
I get some error like
topTable()
doesn't normally produce a column of F values, but of T values. Also, it labels logFC and whatever the column to the right of that is, which is usually not present.Regarding the error with
makeContrasts()
, it'd be rather helpful to know whatdesign
looks like...Can you post a snapshot of your data? How many samples in each condition? Even more importantly, how did you build your design matrix? That will help us helping you tweaking the code
@TriS, @Devon Ryan, Please have a look at above, I updated my question.
Do you mean 1000 treated/tumor/disease/whatever and 100 controls?
When you call
topTable
you should useThe
coef=2
will tell you which comparisons you are making. in case of your design matrix, if you have 100 controls and 1k tumor/disease/treated/etc.. then you will seelogFC > 0
andlogFC < 0
for those higher and lower in the control, respectively.If you want the positive fold changes to reflect higher gene expression in your treated samples, you have to invert the 1s and 0s in the second column of your design matrix, but it all depends which ones are the controls.
Also, to extract the genes that are up/down regulated you can do the following:
This will create a matrix results containing genes with a log2 FC > +/-1.5 and adjusted p.values (FDR) < 0.05
@TriS, You mean I must not use the design to be able to get the logFC values?
If I use coef=2 , I get such error
You must use the design matrix, otherwise limma doesn't know which ones are your controls and which ones are your treated. looks like the size of your design matrix is off = is not the correct size.
The number of rows in the design matrix should be equal to the number of columns in the data matrix.
Can you run
dim(your_data_matrix)
to see how big it is?@TriS Thanks for your message! Yes I only want for now to find those up and down regulated genes . However, I get error using your code and I have been searching to solve it but I could not. The error is as follows:
You want
Inf
rather thaninf
.@Devon Ryan By setting the Inf, it works. However, my results is empty where could be the problem?
I just looked at adj.P Val and I found most of them are higher than 0.05
Results are empty because with the line above you keep only values with adj pvalue < 0.05, which are none.
You can use
This will give you those with p.value (NOT FDR) smaller than 0.05
Check which one is the correct column name, if it is
t$P.Value
ort$P.Val
ort$p.value
, you want to get the spelling right ;)You can get the up/down-regulated genes by
That will subset genes based on a log2 fold changes of 1, you can change that value as you wish. negative values = down-regulated, positive values = up-regulated
I tried to find a cut-off for logFC, however, I want to know your opinion.
Therefore, I think 0.5 would be a good choice to distinguish up from down regulated, no? if so, I tried to find them as you said
then
here I removed the
abs
to only select the positive onesDoes this make any sense ? if so, I am trying to plot its heatmap as follows:
Make a volcano plot of your topTable of logFC and p-value and see if you have some genes which are differentially expressed
@Manvendra Singh based on volcano plot, you hardly can say something, for example, look at my plot
http://tinypic.com/view.php?pic=2e3683b&s=8#.VOTdHEIk_dk
if you know any other way, I will be happy to know
Try this
The significant one will have different color and size
@TriS thanks! Do you know how to put the ID(probes names) on those significant ones?
you highlighted those which have P.value smaller than 0.05 and LogFC higher than 0.5 with red color and those which are much higher than 1.5 with a bigger size and of course red. it is handy and beautiful
http://tinypic.com/view.php?pic=2gtyvs8&s=8#.VOUT-EIk_dk
If you just want to label the big and red ones then you can do it as follow:
the
geom_text()
function allows to do thatRemember, we are happy to help but Google is also a good friend :)
You can go this way too
Oh, that design is not going to tell you anything useful. You're just going to get two intercepts. If the first 1000 samples are your experimental group and the last 100 the control group, then:
mm
is the model matrix appropriate forlmFit()
.Now I did it like this, do you think it is OK now? how then I recognise my up and down regulated ?
Those would be vastly more useful results. Before, you were asking, "what has an average signal reliably above 0?", which I rather doubt you actually cared about. The fold-changes here are relative to the 1000 control samples. Granted, there's nothing significant, but that's a different matter.
@Devon Ryan thanks for your valuable comment ! I really appreciate your help and @TriS
So does it mean , my genes are not statistically significant expressed? Thus, I cannot say which one is up and which one is down regulated?
No, many of them are significantly expressed, rather the model your using doesn't suggest that there are any differences between the two groups. Of course, whether the model is appropriate to the actual experimental design is a different question. I wouldn't be surprised if you have some other experimental factor that needs to be controlled.
If you just wanna see the genes up and down check my code below. for the "significance" you would like to have FDR (or adjusted p.value) < 0.05. This means that you accept that 5% of your genes could be false positive. You could use 0.1 or 0.2 but I don't take hard conclusions out of it and be aware that 10% or 20% or your data might be false positive. Your p.values are relative to the single comparison. each time you do multiple tests (in this case multiple moderated t.tests, 22277 times), you add the chance of creating errors, that's why you correct or multiple testing. However, if you check
sum(t$p.value < 0.05)
I think you will find very few hits.Also, I'm not sure about the error you got up there...
@TriS is it not strange that I get ZERO?
Given the results you showed 2 hours ago, no, that's expected. I suspect that
sum(t$p.value<0.9)
is also 0.@Devon Ryan YES, that is also ZERO, why? Where could be the problem?
Something is wrong here! Look at the plot I just draw
http://tinypic.com/view.php?pic=11gsot2&s=8#.VORL6ELsefQ
http://tinypic.com/view.php?pic=n4f575&s=8#.VORLskLsefQ
http://tinypic.com/view.php?pic=mj4ldc&s=8#.VORMBkLsefQ
I rather obviously can't see any plot.
I don't know why I cannot share the images! I will try to work on it a little bit more and then inform you with new findings!
Biostars doesn't host images or other files for you. Upload them somewhere and link to them.
@Devon Ryan I have uploaded them on a website and you can see them above!
It is funny when I use the design like
I get figures more logical!
With your strange xor-based design, you're calculating two intercepts. You're strongly advised to chat with a local statistician, who can more easily go over the details of your experiment and show you visually what your xor-based design is actually doing (not what you want). Some of the images you posted do look strange, so I suspect there are other issues with how the experiment is being analysed. Given that you seem to be more or less completely new at this, it'll be vastly more efficient to have someone local help you further.
@Devon Ryan After reading Limma Tutorial and other posts on Booster, I definitely understand I must use
model.matrix
and I really appreciate your comments. There should be other problem with the data which I must figure out. Therefore, I repeat all steps I have done for the data and see whether the results change or not!There isn't necessarily a problem. Perhaps there are simply no differentially expressed probes. Alternatively, perhaps the normalization wasn't performed correctly or the design wasn't sufficient to describe your data. The simplest way to find out would be to contact someone locally knowledgeable in bioinformatics. Playing with the data is vastly more efficient than getting help over the 'net.