Question

Ensembl IDs <-> Gene Symbols mapping -- Using org.Hs.eg.db, biomaRt or EnsDb.Hsapiens.v86?

0

Entering edit mode

12 months ago

DGTool ▴ 290

Hiya,

is there a consensus for which of the above packages mentioned in the title would be more suitable for converting between ENSEMBL gene IDs and their respective gene symbols? From what I have gathered in previous discussion is that since biomaRt queries the most up-to-date database of the mappings, it would have also the most up-to-date names. I have tested out both org.Hs.eg.db, Ens.Db.Hsapiens.v86 and biomaRt on a full set of ~60k genes as an example, where org.Hs.eg.db would fail to map ~27k, biomaRt ~20k, and EnsDb.Hsapiens.v86 only ~5k.

From this it seems EnsDb.Hsapiens.v86 to be superior in regards to the number of IDs being mapped, where a lot of the filled out genes began with RP-.* (as well as many with lincRNA/lncRNAs); but then again this would be based on a much older ENSEMBL version (v86) with possibly outdated gene names, and looking into some conflicting entries that exist between EnsDb and biomaRt shows that the latter does have the more up-to-date names for the genes.

Would using a mixture of the DB's be a good idea (i.e. base most on EnsDb, then check if any that failed to map are in Org.Hs, and finally use biomaRt for any missing here, as well as overwriting any conflicting ones)? Or is there a preferred one people would use?

Thanks in advance!

org.hs.eg.db EnsDb.Hsapiens.v86 biomaRt ensembl • 2.6k views

ADD COMMENT • link updated 12 months ago by tomas4482 ▴ 430 • written 12 months ago by DGTool ▴ 290

0

Entering edit mode

Irrespective of what tool you use, all should give more or less consistent results with similar filters (if any) and the same version of databases. For biomaRt, the outputs are the same as that of the biomart datamining tool from Ensembl. There if you filter for protein-coding genes, then the number is indeed somewhere around ~19-20K in the human genome

ADD REPLY • link 12 months ago by manaswwm ▴ 570

score 1 · Answer 1 · 2024-06-05

1

Entering edit mode

12 months ago

ATpoint 88k

I actually find all of these annotation packages tedious (or terrible). I always download the GTF file from GENCODE or Ensembl that matches the annotations I use and then simply do a left_join operation with the gene IDs I want to translate. In R you can directly download and read a GTF via rtracklayer::import giving you a GRanges representation. Convert to data frame and then any downstream, e.g. tidyverse for the joining.

ADD COMMENT • link 12 months ago by ATpoint 88k

0

Entering edit mode

Oh, actually that sounds like a good idea to try out as well, thanks. I've downloaded the GTF file, parsed out the gene_id and gene_name mapping and compared it to the other packages above (just to see the differences). It seems the biomaRt annotation agrees fully with what the GTF says regarding the gene_id <-> symbol mappings so that's good to know. I do wonder then how EnsDb.Hsapiens.v86 has filled in all the others somehow (but since its based on an old release maybe there were some bigger discrepancies in the mapping). Given the above I'll probably stick to the GTF/biomaRt then.

ADD REPLY • link 12 months ago by DGTool ▴ 290

0

Entering edit mode

Gene annotations get updated over time, that can give discrepancy. v86 is very old, many years. Important is that you use the same reference that was used to do your initial processing of data. Say you aligned RNA-seq against Ensembl 101 then be sure to use GTF 101 for such converison now.

ADD REPLY • link 12 months ago by ATpoint 88k

score 1 · Answer 2 · 2024-06-05

1

Entering edit mode

12 months ago

tomas4482 ▴ 430

EnsDb.Hsapiens.v86 is ensembldb object created under Ensembl Homo sapiens Release 86.

For biomaRt, if you use biomaRt default arguments, they will pass you the latest release, which is 112.

Using different annotation and release sources will cause disagreement in genomic coordinates and feature annotations. That is the reason why you fail to map genes. It is not EnsDb.Hsapiens.v86 is superior. Additionally, remember to remove the version ID from gene ID before query hgnc symbol, ensembl id, transcript id and peptide id etc...

If you want to use ensembldb object for other purpose, you can make your own from GTF/GFF files, AnnotationHub, or directly from Ensembl Perl API. To make your own consistent annotation library, please read manual from ensembldb. They provided a detailed method of showing how to call the annotation database from AnnotationHub. AnnotationHub have different annotation databases. Currently, the latest version for human is Relese 111, equal to biomaRt Release 111.

ADD COMMENT • link 12 months ago by tomas4482 ▴ 430

0

Entering edit mode

Yeah, I guess it would be more ideal to go with the most recent release. I've checked the mappings from the GTF file, and they seem to be concordant with the mappings from biomaRt. Good call on the version ID removal, there weren't any on my list but definitely could've been a possibility. I guess it was just suprising that between v86 (EnsDb.Hsapiens.v86) and the most recent release (v112) there would be a difference of ~15k gene<->symbol mappings which were removed (changed?). The ensembldb package does look interesting for possibly some other stuff which might be useful, thanks for mentioning it.

ADD REPLY • link 12 months ago by DGTool ▴ 290

0

Entering edit mode

Not the most recent is the best. Choose the most appropriate version for you. For example, current release of TCGA database uses Gencode 36 for annotation. But the latest Gencode annotation is 46. If you use the lated Gencode anntation, you will still fall into the same pit.

ADD REPLY • link 12 months ago by tomas4482 ▴ 430