Hello,
I am trying to get the gene names along with the Ensembl gene ids in the final edgeR output. The problem is: I am seeing different results when I add the gene symbols and when I don't. Here is example code without gene symbols
library(edgeR)
exp2 <-read.table('exp2',header=TRUE)
y <- readDGE(exp2,labels=exp2$sample,group=exp2$treatment)
keep <- rowSums(cpm(y)>1) >= 2
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y)
bcv=0.4
et_D0_D1 <- exactTest(y, pair=c("D0","D1"),dispersion=bcv^2)
write.table(topTags(et_D0_D1,n=Inf),file="test1.txt",col.names=TRUE,quote=FALSE)
Result is: ENSGACG00000003380 1.00049178363978 4.49563701685595 0.234971858102264 1
Here is the one with gene symbols added using biomart:
library(edgeR)
library(biomaRt)
exp2 <-read.table('exp2',header=TRUE)
y <- readDGE(exp2,labels=exp2$sample,group=exp2$treatment)
geneid <-rownames(y)
ga82 <- useEnsembl(biomart="ensembl",dataset="gaculeatus_gene_ensembl",version=82)
genes <- getBM(attributes=c('ensembl_gene_id','external_gene_name'),filters='ensembl_gene_id',values=geneid,mart=ga82)
y$genes <- genes
keep <- rowSums(cpm(y)>1) >= 2
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y)
bcv=0.4
et_D0_D1 <- exactTest(y, pair=c("D0","D1"),dispersion=bcv^2)
write.table(topTags(et_D0_D1,n=Inf),file="test2.txt",col.names=TRUE,quote=FALSE)
Here the result is: ENSGACG00000003380 NA -0.135739707808875 2.95905945166089 0.874200189025877 1
SessionInfo() output is given below:
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.24.1 edgeR_3.10.4 limma_3.24.15
loaded via a namespace (and not attached):
[1] IRanges_2.2.9 DBI_0.3.1 parallel_3.2.1
[4] RCurl_1.95-4.7 Biobase_2.28.0 AnnotationDbi_1.30.1
[7] RSQLite_1.0.0 S4Vectors_0.6.6 BiocGenerics_0.14.0
[10] GenomeInfoDb_1.4.3 stats4_3.2.1 bitops_1.0-6
[13] XML_3.98-1.3
I am bit puzzled. Why would addition of gene names in y$genes above change the results. Any help you could provide will be very helpful. Many thanks in advance.
how many rows do you have before and after adding the gene names?
Thanks for your reply. I have 22460 rows both before and after adding the gene names
please also post top 5-10 results for before and after cases.
Before adding gene names:
After adding genenames:
Your gene-annotation is somehow not correct. See that the numbers (FC, Pvalue etc) are same (before and after), but the ensembl_gene_id is different!
Aha! There is merging issue. Any ideas how can I fix this?
?merge()
PS: This just to keep away the min-char bot
Thanks Santosh. Very helpful
Please close this thread by accepting @MMa answer (green check mark to the left) if it solved your problem. Else, you can also write your own answer and accept it.