98.21% of input gene IDs are fail to map
3
2
Entering edit mode
6.8 years ago
wayj86 ▴ 40

Hi all,

I am using clusterProfiler to perform KEGG enrichment. I have a list of gene Symbol (Rat genes). First I need to translate the symbols to EntrezID.

I installed the Genome wide annotation for Rat:

source("https://bioconductor.org/biocLite.R")
biocLite("org.Rn.eg.db")

Then I tried to run the command:

x <- c("GPX3",  "GLRX",   "LBP",   "CRYAB", "DEFB1", "HCLS1",   "SOD2",   "HSPA2",
       "ORM1",  "IGFBP1", "PTHLH", "GPC3",  "IGFBP3","TOB1",    "MITF",   "NDRG1",
       "NR1H4", "FGFR3",  "PVR",   "IL6",   "PTPRM", "ERBB2",   "NID2",   "LAMB1",
       "COMP",  "PLS3",   "MCAM",  "SPP1",  "LAMC1", "COL4A2",  "COL4A1", "MYOC",
       "ANXA4", "TFPI2",  "CST6",  "SLPI",  "TIMP2", "CPM",     "GGT1",   "NNMT",
       "MAL",   "EEF1A2", "HGD",   "TCN2",  "CDA",   "PCCA",    "CRYM",   "PDXK",
       "STC1",  "WARS",  "HMOX1", "FXYD2", "RBP4",   "SLC6A12", "KDELR3", "ITM2B")

eg = bitr(x, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Rn.eg.db")

Then I got the warning message:

'select()' returned 1:1 mapping between keys and columns Warning message: In bitr(x4, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = Rat) : 98.21% of input gene IDs are fail to map...

Could anyone tell me what happened and how should I do?

Many thanks, Stanley

clusterProfiler R • 11k views
ADD COMMENT
0
Entering edit mode

wayj86 : If you use @Kevin's solution below then be sure to replace Rat biomart in place of human example below.

ADD REPLY
0
Entering edit mode

Good catch genomax. I have added a comment to my answer for Rattus norvegicus.

ADD REPLY
4
Entering edit mode
6.8 years ago
Guangchuang Yu ★ 2.6k
> paste0(substring(x, 1, 1), tolower(substring(x, 2))) -> x2
> x2[x2=="Pvr"] = "PVR"
> eg = bitr(x2, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Rn.eg.db")
'select()' returned 1:1 mapping between keys and columns
> eg
    SYMBOL ENTREZID
1     Gpx3    64317
2     Glrx    64045
3      Lbp    29469
4    Cryab    25420
5    Defb1    83687
6    Hcls1   288077
7     Sod2    24787
8    Hspa2    60460
9     Orm1    24614
10  Igfbp1    25685
11   Pthlh    24695
12    Gpc3    25236
13  Igfbp3    24484
14    Tob1   170842
15    Mitf    25094
16   Ndrg1   299923
17   Nr1h4    60351
18   Fgfr3    84489
19     PVR    25066
20     Il6    24498
21   Ptprm    29616
22   Erbb2    24337
23    Nid2   302248
24   Lamb1   298941
25    Comp    25304
26    Pls3    81748
27    Mcam    78967
28    Spp1    25353
29   Lamc1   117036
30  Col4a2   306628
31  Col4a1   290905
32    Myoc    81523
33   Anxa4    79124
34   Tfpi2   286926
35    Cst6   171096
36    Slpi    84386
37   Timp2    29543
38     Cpm   314855
39    Ggt1   116568
40    Nnmt   300691
41     Mal    25263
42  Eef1a2    24799
43     Hgd   360719
44    Tcn2    64365
45     Cda   362638
46    Pcca   687008
47    Crym   117024
48    Pdxk    83578
49    Stc1    81801
50    Wars   314442
51   Hmox1    24451
52   Fxyd2    29639
53    Rbp4    25703
54 Slc6a12    50676
55  Kdelr3   315131
56   Itm2b   290364

I think we should raise this issue to Bioconductor. IMO, Mapping ID with case insensitive is necessity.

ADD COMMENT
2
Entering edit mode

biomaRt performs this in a case insensitive fashion. Look at my answer.

ADD REPLY
0
Entering edit mode

You can just grab the appropriate bimap into R to do the mapping yourself:

head(as.data.frame(org.Rn.egSYMBOL2EG))

Which gives back:

  gene_id symbol
1   24152   Asip
2   24153    A2m
3   24157 Acaa1a
4   24158  Acadm
5   24159   Acly
6   24161   Acp1

This is detailed in the "bimaps" section of the AnnotationDbi vignette. Once you have a data frame, you can manipulate your matching as you see fit. Another option is to grab the sqlite connection object from the org.db.Rn package and write custom SQL.

ADD REPLY
0
Entering edit mode

Dear Uncle Y, thank you very much for your answer. It's my honor to meet you here. I am your big fan: )

ADD REPLY
3
Entering edit mode
6.8 years ago

Hi Stanley,

Please try something like this, using the biomaRt package:

require(biomaRt)
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)
getBM(mart=mart, attributes=c("hgnc_symbol","entrezgene_id"),
  filter="hgnc_symbol", values=x, uniqueRows=TRUE)
   hgnc_symbol entrezgene_id
1        ANXA4        307
2          CDA        978
3       COL4A1       1282
4       COL4A2       1284
5         COMP       1311
6          CPM       1368
7        CRYAB       1410
8         CRYM       1428
9         CST6       1474
10       DEFB1       1672
11      EEF1A2       1917
12       ERBB2       2064
13       FGFR3       2261
14       FXYD2        486
15        GGT1       2678
16        GGT1     728441
17        GGT1  102724197
18        GLRX       2745
19        GPC3       2719
20        GPX3       2878
21       HCLS1       3059
22         HGD       3081
23       HMOX1       3162
24       HSPA2       3306
25      IGFBP1       3484
26      IGFBP3       3486
27         IL6       3569
28       ITM2B       9445
29      KDELR3      11015
30       LAMB1       3912
31       LAMC1       3915
32         LBP       3929
33         MAL       4118
34        MCAM       4162
35        MITF       4286
36        MYOC       4653
37       NDRG1      10397
38        NID2      22795
39        NNMT       4837
40       NR1H4       9971
41        ORM1       5004
42        PCCA       5095
43        PDXK       8566
44        PDXK  105372824
45        PLS3       5358
46       PTHLH       5744
47       PTPRM       5797
48         PVR       5817
49        RBP4       5950
50     SLC6A12       6539
51        SLPI       6590
52        SOD2       6648
53        SOD2  100129518
54        SPP1       6696
55        STC1       6781
56        TCN2       6948
57       TFPI2       7980
58       TIMP2       7077
59        TOB1      10140
60        WARS       7453

Regarding annotation conversions, there are usually one-to-many mappings, which can complicate things Also important to be aware that many genomic loci are still not well annotated, whilst at others transcripts overlap each other (having different promoters and transcription start sites).

Kevin

ADD COMMENT
0
Entering edit mode

Rattus norvegicus:

require(biomaRt)
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("rnorvegicus_gene_ensembl", mart)
getBM(mart=mart, attributes=c("external_gene_name","entrezgene_id"),
  filter="external_gene_name", values=x, uniqueRows=TRUE)
   external_gene_name entrezgene_id
1                Spp1      25353
2                Glrx      64045
3                Comp      25304
4                Ggt1         NA
5               Ndrg1     299923
6                Slpi      84386
7               Lamc1     117036
8                Myoc      81523
9              Kdelr3     315131
10               Mitf      25094
11              Itm2b     290364
12               Pdxk      83578
13               Pdxk     361819
14             Igfbp3      24484
15                Il6      24498
16            Slc6a12      50676
17               Gpx3      64317
18               Pls3      81748
19               Crym     117024
20              Nr1h4      60351
21              Erbb2      24337
22                Lbp      29469
23               Gpc3         NA
24               Sod2      24787
25               Tob1     170842
26                Hgd     360719
27             Eef1a2      24799
28              Defb1      83687
29                Cpm     314855
30              Timp2      29543
31              Anxa4      79124
32                PVR      25066
33               Mcam      78967
34              Fxyd2      29639
35               Pcca     687008
36                Mal      25263
37             Igfbp1      25685
38             Col4a1     290905
39               Rbp4      25703
40               Wars     314442
41              Lamb1     298941
42               Orm1         NA
43              Tfpi2     286926
44              Cryab      25420
45              Hmox1      24451
46               Stc1      81801
47               Nnmt     300691
48              Ptprm      29616
49             Col4a2     306628
50              Fgfr3      84489
51              Hcls1     288077
52              Hspa2      60460
53               Nid2         NA
54               Cst6     171096
55                Cda  100909857
56               Tcn2      64365
ADD REPLY
0
Entering edit mode

Hi Kevin,

When I used just as you showed above I get an error message:

Error in getBM(mart = mart, attributes = c("external_gene_name", "entrezgene"), : Invalid attribute(s): entrezgene Please use the function 'listAttributes' to get valid attribute names

Why is it saying invalid attribute?

ADD REPLY
0
Entering edit mode

It changed since my post. You now need to use entrezgene_id.

I have modified my original post (above)

ADD REPLY
3
Entering edit mode
6.8 years ago
Mike Smith ★ 2.1k

The root of your problem is that your gene names are all upper case, but for rat (and mouse etc) generally only the first letter is capitalised, and the bitr conversion is case sensitive. We can use the function str_to_title in the stringr package to fix this:

install.packages('stringr')
x2 <- stringr::str_to_title(x) 
head(x2)

Here's what they look like now:

[1] "Gpx3"  "Glrx"  "Lbp"   "Cryab" "Defb1" "Hcls1"

Then rerun your query:

eg = bitr(x2, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Rn.eg.db")

and we get a much better proportion of converted IDs

'select()' returned 1:1 mapping between keys and columns
Warning message:
In bitr(str_to_title(x), fromType = "SYMBOL", toType = "ENTREZID",  :
  1.79% of input gene IDs are fail to map...

Looking at some of the other answers, PVR is an exception to the capitalisation rule. Since Kevin's biomaRt answer isn't case sensitive, it's perhaps the best solution assuming you have reliable internet.

ADD COMMENT
0
Entering edit mode

actually "PVR" is all in upper case. Is this an exception?

ADD REPLY
0
Entering edit mode

Yes look like it is.

ADD REPLY

Login before adding your answer.

Traffic: 2658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6