Those problematic Ensembl IDs are ones that have mappings to multiple Entrez (Refseq) IDs. You also make your situation difficult by converting your list to an array, then a data-frame, and then a matrix. You'll notice that these multiple mappings, based on the way that you've processed the data and just before your matrix conversion step, are separated by a comma in each entry. Using your code:
xx <- as.list(org.Mm.egENSEMBL2EG)
xx_table <- as.array(xx)
xx_table <- as.data.frame(xx_table)
tail(xx_table, 51)
xx_table
ENSMUSG00000090744 102642386
ENSMUSG00000095717 102642706, 102642868
ENSMUSG00000072915 102642717
Thus, the subsequent conversion of this to a data-matrix (which only allows numerical values) trips up because it sees the comma and doesn't know what to do with it. The as.matrix
function neither gives a warning, but it should.
The most efficient way to convert between a list and a data-frame is with do.call
, as you'll see in my code below:
xx <- as.list(org.Mm.egENSEMBL2EG)
xx_table <- do.call(rbind, lapply(xx, data.frame, stringsAsFactors=FALSE))
xx_table <- data.frame(rownames(xx_table), xx_table)
You'll then begin to see the issue:
tail(xx_table,55)
rownames.xx_table. X..i..
ENSMUSG00000094556 ENSMUSG00000094556 102641780
ENSMUSG00000095508 ENSMUSG00000095508 102641863
ENSMUSG00000103587 ENSMUSG00000103587 102642162
ENSMUSG00000090744 ENSMUSG00000090744 102642386
ENSMUSG00000095717.1 ENSMUSG00000095717.1 102642706
ENSMUSG00000095717.2 ENSMUSG00000095717.2 102642868
ENSMUSG00000072915 ENSMUSG00000072915 102642717
Here, ENSMUSG00000095717 has a mapping to 2 Entrez IDs and do.call (coupled with data.frame) has renamed the IDs to make them unique. We can tidy these up with gsub
and then finish the remainder of the code:
xx_table[,1] <- gsub("\\.[0-9]*$", "", xx_table[,1])
write.csv(xx_table, "~/ens2ncbi.csv")
ens2ncbi <- read.csv(file="~/ens2ncbi.csv")
ens2ncbi <-ens2ncbi[, 3:2]
colnames(ens2ncbi) <- c("Entrez", "Ensembl")
head(ens2ncbi)
Entrez Ensembl
1 11287 ENSMUSG00000030359
2 11298 ENSMUSG00000020804
3 11302 ENSMUSG00000025375
4 11303 ENSMUSG00000015243
5 11304 ENSMUSG00000028125
6 11305 ENSMUSG00000026944
tail(ens2ncbi,52)
Entrez Ensembl
24186 102642386 ENSMUSG00000090744
24187 102642706 ENSMUSG00000095717
24188 102642868 ENSMUSG00000095717
24189 102642717 ENSMUSG00000072915
24190 102902673 ENSMUSG00000096370
24191 103164605 ENSMUSG00000102424
24192 104795665 ENSMUSG00000092765
24193 104795666 ENSMUSG00000093246
You'll just have to be wary of this going forward. Many of the merge functions will only take the first match that it finds, which may just be fine.
Kevin
Thank you very much Kevin, and sorry for always being a pain. As a musician (or should I use "as.musician"??) I always find difficult to deal with this kind of problem and for me the bioinformatics is an ongoing process in a learning by doing, trial and error, stepwise process. I am working on the code and I will let you know if any issues still occur.
Hi Mozart, so, a new symphony is being released soon?! will it be available on Bioconductor?
No problem. The learning process even for me and the most senior Professors is never ending. Should one assume that they already know everything, then they just highlight how little they truly know.
I encounter bugs on an almost daily basis. It is very difficult to account for all eventualities though. Systems like air traffic control systems, though, obviously do have to account for all eventualities. They have different levels of testing than our standard bioinformatics tools though.
I see; things is, at some point..I mean it's quite frustrating to be blocked by what for you, expert guys, is just a simple issue, anyway..I come from a totally different background and it's like coming back at University spending hours and hours on "silly things". for example I spent all the day long trying to run ReactomePA...and I am still not able to solve the problem... Generally speaking, my strategy is to reproduce the tutorial in order to understand where I "fall". this time was pretty easy as they use in the example
returning a table like that (so please notice the grey column with the NCBI name and the white column with p value)
and I tried and I tried my best but I was not able to make something better than that (where the grey column tells just the order of both values...I am not even sure whether the column the grey column is a column or just the row name...)
but probably I will sort out tomorrow!
If you have a separate question, it would be a good idea to open a new thread.
It also sounds like you need a guide for the lonely bioinformatician, written by my professional colleague Mick Watson (he's Scottish; I'm Irish - same thing).
That's just an automatically-assigned row number. You can ignore it.
In R:
You could most likely set your rownames to Entrez IDs with:
thanks so much for your reply, and sorry if I didn't opened another form for another problem, I will keep in mind for the next time (or am I supposed to create a new post for this?); I tried to do so but something unexpected happened because I got this kind of error
so I tried with 'make.names' creating a matrix from scratch and putting the enter as row.names ut I thing it's a way too messy for me. Any suggestions guys?
I guess that I was just giving an example of how to set rownames. You do not actually have to set rowname in this case. You have the data-frame with Ensembl-to-Entrez mappings, with Entrez in the first column and Ensembl in the second. You don't have to set rownames.
The error is produced here because, evidently, we also have the situation where more than one Ensembl ID map to the same Enrez ID. Working across annotations, these issues always occur.
Thanks very much Kevin, it seems to work smoothly now; I can definitely come up with a new composition now, thanks for give me the right inspiration!