Hello All
I have 2 mammalian cell lines: A and B. A is subjected to 3 treatment conditions X,Y,Z. B is subjected to treatment conditions X and Y. So my RNA-Seq data has counts from AX, AY, AZ, BX and BY. I want to perform anova analysis to screen for differentially regulated genes.
I am trying to analyze my data using edgeR by following the users guide. Since I am completely new to this, it is not very intuitive to me. I want to make a model matrix to perform the anova-like comparison test to screen for genes that are different across all 5 conditions.
My code is below:
#count data
AX1 AX2 AY1 AY2 AZ1 AZ2 BX1 BX2 BY1 BY2
rna22510 0 0 0 0 0 0 0 0 0 0
rna22511 0 0 0 0 0 0 0 0 0 0
rna22512 0 0 0 0 0 0 0 0 0 0
rna22513 0 0 0 0 0 0 0 0 0 0
rna22514 0 0 0 0 0 0 0 0 0 0
rna22515 0 0 0 0 0 0 0 0 0 0
###I normalize the data and calculate different dispersions. But I feel my main issue is trying to define the matrix to suit my needs.
> design<-model.matrix(~0+group, data=cds$samples)
> design
#cds is a DGEList object.
#output
groupH_05 groupH_10 groupP_00 groupP_05 groupP_10
P_10 1 0 0 0 0 1
P_10 2 0 0 0 0 1
P_05 1 0 0 0 1 0
P_05 2 0 0 0 1 0
P_00 1 0 0 1 0 0
P_00 2 0 0 1 0 0
H_10 1 0 1 0 0 0
H_10 2 0 1 0 0 0
H_05 1 1 0 0 0 0
H_05 2 1 0 0 0 0
> cds<-estimateGLMCommonDisp(cds,design)
> cds<-estimateGLMTrendedDisp(cds,design)
Loading required package: splines
> cds<-estimateGLMTagwiseDisp(cds,design)
> fit2<-glmFit(cds,design)
> lrt.anova_like<-glmLRT(fit2,coef=2:5)
> lrt.anova_like
An object of class "DGELRT"
$coefficients
groupH_05 groupH_10 groupP_00 groupP_05 groupP_10
gene1 -12.722907 -12.231696 -11.460983 -11.881307 -11.863651
rna1 -10.723691 -11.950632 -11.671753 -11.126594 -11.035141
rna2 -9.496628 -9.295025 -9.801454 -9.722280 -9.247496
rna3 -8.249089 -8.176574 -8.175494 -8.220004 -8.093095
rna4 -10.855702 -10.826411 -10.816899 -11.020586 -10.697726
12745 more rows ...
> top<-topTags(lrt.anova_like)
> top
Coefficient: groupH_10 groupP_00 groupP_05 groupP_10
logFC.groupH_10 logFC.groupP_00 logFC.groupP_05 logFC.groupP_10
rna111 -26.24556 -19.80748 -21.22361 -21.59235
rna366 -26.24556 -16.69825 -17.77939 -17.69264
rna498 -26.24556 -18.59508 -19.06105 -19.82423
rna735 -26.24556 -17.55049 -26.24556 -26.24556
rna1587 -26.24556 -17.97298 -26.24556 -26.24556
rna2000 -26.24556 -14.89324 -26.24556 -26.24556
gene2467 -26.24556 -23.90012 -19.09330 -26.24556
rna2338 -26.24556 -20.38091 -21.21060 -20.19601
rna2962 -26.24556 -21.82490 -21.59319 -22.53018
rna3147 -26.24556 -19.25034 -18.93527 -21.38125
logCPM LR PValue FDR
rna111 -0.76528954 9870.499 0 0
rna366 2.06883064 36241.154 0 0
rna498 0.34196368 16723.335 0 0
rna735 0.31486174 3362.217 0 0
rna1587 -0.06429664 3358.159 0 0
rna2000 2.74046265 2598.921 0 0
gene2467 -0.69390328 13430.350 0 0
rna2338 -0.79058544 2893.519 0 0
rna2962 0.06280098 3613.531 0 0
rna3147 0.05144112 10288.125 0 0
In the above output, the P-value of the toptags is zero and even the FDR is zero which makes me think that my output is wrong.
I would appreciate any help that you could provide me.
Thanks
Your paramaterization is mostly useful for looking at contrasts, which you don't seem to be doing. The p-value will be ~0 for anything that's expressed, which is unlikely to be the question you want to answer.
Thank you for your reply Devon. The reason I was worried about p-value is because I have noticed really small p-values but not a zero. Why do you say that I am not doing contrasts, I thought
glmLRT(fit2, coef=2:5)
was supposed to do that.Could you also suggest a method/changes that I can make to my code to obtain anova like analysis. Basically I want to look for genes that are differentially expressed without specifying the groups beforehand (like mentioned in the users guide). And it looks like it based on making contrasts too.
I am very new to programming and bioinformatics, so any help would be appreciated.
I missed the
glmLRT(fit2, coef=2:5)
line, but that's probably not telling you anything of biological interest. Why don't you just say the biological question that you wish to address?Edit: Actually, I think you need an intercept column for
glmLRT(fit2, coef=2:5)
to make sense. Otherwise, you probably do end up testing expression.Edit2: In fact, this is mentioned in the edgeR UsersGuide, section 3.2.6. Regarding a 1-way ANOVA-like test, "... this is done by specifying multiple coefficients to glmLRT, when the design matrix includes an intercept term."
Sure. So basically I have two chinese hamster ovary cells lines. I am trying to study the transcriptome changes during serum-free adaptation. Typically 10% serum is added to media to mammalian cell cultures and they have to be adapted to serum-free media. So I identified three different serum concentrations where I want to study RNA of the cells - 10%, 5% and 0%. Only one cell line (A) was able to adapt completely, hence the other cell line(B) has only two different serum concentrations (10%, 5%).
What I was thinking was to first do an ANOVA analysis to see genes that are differentially regulated since doing a pairwaise comparison initially would generate a lot of false positives. After the anova analysis I would like to do pairwise comparison between meaningful conditions.