Hiya,
is there a consensus for which of the above packages mentioned in the title would be more suitable for converting between ENSEMBL gene IDs and their respective gene symbols? From what I have gathered in previous discussion is that since biomaRt
queries the most up-to-date database of the mappings, it would have also the most up-to-date names. I have tested out both org.Hs.eg.db
, Ens.Db.Hsapiens.v86
and biomaRt
on a full set of ~60k genes as an example, where org.Hs.eg.db
would fail to map ~27k, biomaRt
~20k, and EnsDb.Hsapiens.v86
only ~5k.
From this it seems EnsDb.Hsapiens.v86
to be superior in regards to the number of IDs being mapped, where a lot of the filled out genes began with RP-.*
(as well as many with lincRNA/lncRNAs); but then again this would be based on a much older ENSEMBL version (v86) with possibly outdated gene names, and looking into some conflicting entries that exist between EnsDb
and biomaRt
shows that the latter does have the more up-to-date names for the genes.
Would using a mixture of the DB's be a good idea (i.e. base most on EnsDb, then check if any that failed to map are in Org.Hs, and finally use biomaRt
for any missing here, as well as overwriting any conflicting ones)? Or is there a preferred one people would use?
Thanks in advance!
Irrespective of what tool you use, all should give more or less consistent results with similar filters (if any) and the same version of databases. For biomaRt, the outputs are the same as that of the biomart datamining tool from Ensembl. There if you filter for protein-coding genes, then the number is indeed somewhere around ~19-20K in the human genome