R org.Hs.eg.db matching ensembl gene ids with gene symbol
2
5
Entering edit mode
8.7 years ago
User6891 ▴ 330

Hi,

I want to add a column with the gene symbol corresponding to the Ensembl Gene ID to a dataframe in R

resOrdered$symbol <- mapIds(org.Hs.eg.db,
                     keys=row.names(resOrdered),
                     column="SYMBOL",
                     keytype="ENSEMBL",
                     multiVals="first")

I'm using org.Hs.eg.db from BioConductor for this.

I get the following error:

Error in .testForValidKeys(x, keys, keytype, fks) : 
  None of the keys entered are valid keys for 'ENSEMBL'. Please use the keys method to see a listing of valid arguments.

I think this is because my row.names from my dataframe resOrdered look like this:

[9997] "ENSG00000100601.5"  "ENSG00000178826.6"  "ENSG00000243663.1"  "ENSG00000138231.8"

I think the problem is that there are ., that signify the version, after the actual ENGS. Is there a way to still find a match with the ENSEMBL key from org.Hs.eg.db?

bioconductor R • 45k views
ADD COMMENT
4
Entering edit mode

Otherwise, you can always remove the string after the period.

tmp=gsub("\\..*","",row.names(resOrdered)​)
ADD REPLY
0
Entering edit mode

hello Sukhdeep,

I have exactly the same question as User6891 and after i try to remove the decimal i get an error.

Error: unexpected input in "tmp=gsub("\\..*","",row.names(res)�"

Could you please help me with this?

ADD REPLY
1
Entering edit mode

Command should work, I see you have some unidentified symbol in the command you pasted.

Try to write it and see if it works!

ADD REPLY
0
Entering edit mode
tmp=gsub("\\..*","",row.names(res)​)

this is my command ...and it shows a question mark in the error.

Error: unexpected input in "tmp=gsub("\\..*","",row.names(res)�"
ADD REPLY
0
Entering edit mode

As I said, the above command should work, unless you have a copy-paste error, or the object res has some issue. Check row.names(res), what does it outputs!

ADD REPLY
0
Entering edit mode

Its working thanks alot :) and thanks for your patience.

But 1 more question how do i put the edited ENSEMBL id from tmp back to my res column?

I know it is a very basic question but I am new to R.

ADD REPLY
0
Entering edit mode

Thanks alot Sukhdeep ...it all worked fine :)

ADD REPLY
0
Entering edit mode

Great, good luck then!

ADD REPLY
0
Entering edit mode

How did you eventually add tmp back to the res row.names? The answer is not in this thread and I can't figure it out.

Also, is it possible to edit the gene ids in-place instead of creating 'tmp'?

ADD REPLY
0
Entering edit mode

can you explain what does it "\\..*","",

ADD REPLY
0
Entering edit mode

remove the string after the period i.e. delete (technically substitute) everything that follows. See this.

ADD REPLY
6
Entering edit mode
6.2 years ago

Could try biomaRt:

require("biomaRt")
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)

ens <- c("ENSG00000100601.5", "ENSG00000178826.6",
  "ENSG00000243663.1", "ENSG00000138231.8")
ensLookup <- gsub("\\.[0-9]*$", "", ens)
ensLookup
[1] "ENSG00000100601" "ENSG00000178826" "ENSG00000243663" "ENSG00000138231"

annotLookup <- getBM(
  mart=mart,
  attributes=c("ensembl_transcript_id", "ensembl_gene_id",
    "gene_biotype", "external_gene_name"),
  filter="ensembl_gene_id",
  values=ensLookup,
  uniqueRows=TRUE)

annotLookup <- data.frame(
  ens[match(annotLookup$ensembl_gene_id, ensLookup)],
  annotLookup)

colnames(annotLookup) <- c(
  "original_id",
  c("ensembl_transcript_id", "ensembl_gene_id",
    "gene_biotype", "external_gene_name"))

annotLookup

         original_id ensembl_transcript_id ensembl_gene_id         gene_biotype
1  ENSG00000100601.5       ENST00000216489 ENSG00000100601       protein_coding
2  ENSG00000100601.5       ENST00000557057 ENSG00000100601       protein_coding
3  ENSG00000100601.5       ENST00000555100 ENSG00000100601       protein_coding
4  ENSG00000100601.5       ENST00000554097 ENSG00000100601       protein_coding
5  ENSG00000138231.8       ENST00000260803 ENSG00000138231       protein_coding
6  ENSG00000138231.8       ENST00000460271 ENSG00000138231       protein_coding
7  ENSG00000138231.8       ENST00000477557 ENSG00000138231       protein_coding
8  ENSG00000138231.8       ENST00000463982 ENSG00000138231       protein_coding
9  ENSG00000178826.6       ENST00000409102 ENSG00000178826       protein_coding
10 ENSG00000178826.6       ENST00000487419 ENSG00000178826       protein_coding
11 ENSG00000178826.6       ENST00000359333 ENSG00000178826       protein_coding
12 ENSG00000178826.6       ENST00000480421 ENSG00000178826       protein_coding
13 ENSG00000178826.6       ENST00000409244 ENSG00000178826       protein_coding
14 ENSG00000178826.6       ENST00000409541 ENSG00000178826       protein_coding
15 ENSG00000178826.6       ENST00000410004 ENSG00000178826       protein_coding
16 ENSG00000178826.6       ENST00000482420 ENSG00000178826       protein_coding
17 ENSG00000178826.6       ENST00000471161 ENSG00000178826       protein_coding
18 ENSG00000243663.1       ENST00000493072 ENSG00000243663 processed_pseudogene
   external_gene_name
1              ALKBH1
2              ALKBH1
3              ALKBH1
4              ALKBH1
5                DBR1
6                DBR1
7                DBR1
8                DBR1
9             TMEM139
10            TMEM139
11            TMEM139
12            TMEM139
13            TMEM139
14            TMEM139
15            TMEM139
16            TMEM139
17            TMEM139
18           RPS4XP14

...or without ensembl_transcript_id:

annotLookup <- getBM(
  mart=mart,
  attributes=c("ensembl_gene_id", "gene_biotype", "external_gene_name"),
  filter="ensembl_gene_id",
  values=ensLookup,
  uniqueRows=TRUE)

annotLookup <- data.frame(
  ens[match(annotLookup$ensembl_gene_id, ensLookup)],
  annotLookup)

colnames(annotLookup) <- c(
  "original_id",
  c("ensembl_gene_id", "gene_biotype", "external_gene_name"))

annotLookup
    original_id ensembl_gene_id         gene_biotype external_gene_name
1 ENSG00000100601.5 ENSG00000100601       protein_coding             ALKBH1
2 ENSG00000138231.8 ENSG00000138231       protein_coding               DBR1
3 ENSG00000178826.6 ENSG00000178826       protein_coding            TMEM139
4 ENSG00000243663.1 ENSG00000243663 processed_pseudogene           RPS4XP14
ADD COMMENT
0
Entering edit mode

Hi,

Thanks for your comment. I need your guide. I have "original_id" column and alse "gene_name"(e.g. ENSG00000100601.5 and ALKBH1) and I need their "Entrez ID". could you please guide me how do I get "Entrez ID" by biomaRt or other package from "original_id" ? I appreciate if you share your comment with me. Best Regards

ADD REPLY
2
Entering edit mode

Take a look at this example, which will obtain Entrez IDs for you:

require("biomaRt")
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)

ens <- c("ENSG00000100601.5", "ENSG00000178826.6", "ENSG00000243663.1", "ENSG00000138231.8")
ensLookup <- gsub("\\.[0-9]*$", "", ens)


annotLookup <- getBM(
  mart=mart,
  attributes=c("ensembl_transcript_id", "ensembl_gene_id", "gene_biotype", "external_gene_name", "entrezgene_id"),
  filter="ensembl_gene_id",
  values=ensLookup,
  uniqueRows=TRUE)

annotLookup <- data.frame(
  ens[match(annotLookup$ensembl_gene_id, ensLookup)],
  annotLookup)

colnames(annotLookup) <- c(
  "original_id",
  c("ensembl_transcript_id", "ensembl_gene_id", "gene_biotype", "external_gene_name", "EntrezID"))

annotLookup
ADD REPLY
0
Entering edit mode

Hi,

This is a very useful script, thank you!

I have a question about the result that I got. I run this script for my 58283 Ensembl gene ids (eg. ENSG00000000003) , however, I end up with 57714 converted hgnc_symbol. Do you have any idea why it is happening and if there is way to get the deleted gene ids?

Kind regards

ADD REPLY
0
Entering edit mode

You could obtain the IDs that do not match via inference of the ones that do match. For example, take a look at the which() command.

The different organisations that maintain annotations, e.g., Ensembl, HGNC, Entrez, Riken, UCSC, etc., each has different rules about which to annotate or not, based on evidence. Between Ensembl and HGNC, the differences likely relate to non-coding transcripts that Ensembl has decided should be annotated, whereas, HGNC has decided that there is not yet enough evidence to support the existence of the transcript.

ADD REPLY
0
Entering edit mode

Can you confirm that you're getting hgnc_symbol, and not external_gene_name as Kevin suggests. hgnc_symbol will only get you HGNC, which means a lot of non-coding genes, which are named by Rfam, will not return a gene name if you get hgnc_symbol. If you get external_gene_name, you should get a name for everything as it will get whatever gene name there is.

If you're using external_gene_name and still not getting names for a bunch of them, please can you send the list to helpdesk@ensembl.org and we'll take a look. We have had some problems in recent releases with HGNC mapping, so this may be the cause of the missing names.

ADD REPLY
0
Entering edit mode

Hi @Emily_Ensembl,

I tried both and I got even less results with external_gene_name. I will send you an email with the list.

From what I can tell those IDs that were not found in the database are either:

  • IDs that finish with _PAR_Y, which I don’t know what it means. Do you have any idea? Is it okay to completely remove that part from the ID?
  • IDs that were deprecated and are not in the current ENSEMBL database anymore.

I want to map the deprecated ones to their new updated genes but I'm not sure if it's even possible. Do you know if there is a way to do that?

Many thanks.

ADD REPLY
0
Entering edit mode

PAR_Y likely relates to the pseudo-autosomal region on chromosome Y in males, i.e., the region on chromosome Y that pairs that with the PAR on chromosome X during meiosis in males. Take a look:

You can decid if you need them or not; otherwise, state in your methods that you removed them.

ADD REPLY
0
Entering edit mode
3.7 years ago

Another way: build a master table and use that:

require(org.Hs.eg.db)
keytypes(org.Hs.eg.db)

annot <- select(org.Hs.eg.db,
  keys = keys(org.Hs.eg.db),
  columns = c('ENTREZID','SYMBOL','ENSEMBL','ENSEMBLTRANS'),
  keytype = 'ENTREZID')

head(annot)

  ENTREZID SYMBOL         ENSEMBL    ENSEMBLTRANS
1        1   A1BG ENSG00000121410            <NA>
2        2    A2M ENSG00000175899            <NA>
3        3  A2MP1 ENSG00000256069 ENST00000543404
4        3  A2MP1 ENSG00000256069 ENST00000566278
5        3  A2MP1 ENSG00000256069 ENST00000545343
6        3  A2MP1 ENSG00000256069 ENST00000544183
ADD COMMENT
0
Entering edit mode

I tried this option but it seems that the annotation in org.Hs.eg.db is not as updated as in biomaRt.

40% of the ensembl genes (60k) I used as input in org.Hs.eg.db (v. 3.13.0) were not matched with a corresponding ensembl transcript ID, which looks weird to me.

With biomaRt (v. 2.48.1), only 5% of my ensembl gene IDs input were not mapped, it seems because those 5% were outdated IDs.

So I would choose biomaRt to annotate Ensembl IDs instead of org.Hs.eg.db .

ADD REPLY

Login before adding your answer.

Traffic: 2087 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6