in R, how to get the gene symbols of GPL
1
1
Entering edit mode
10.2 years ago
Zhilong Jia ★ 2.2k

In R, how to get the gene symbols of a GPL (like GPL570)? There is no gene symbols in the annotation for some GPL from GEO. I have lots of GPL from multi platforms. for these with no gene symbol, they have GB_ACC, so more specifically, how to map GB_ACC to gene symbol in R?

Tools like ailun can do this online or WSDL or Perl. But not in R or other language I master. Are there any other tools?

GPL GEO gene-symbol • 14k views
ADD COMMENT
0
Entering edit mode

Please add some more details and context to your question.

ADD REPLY
0
Entering edit mode

Thank you. I've added some details.

ADD REPLY
0
Entering edit mode

Thank you. so the point is how do the authors of GEOmetadb (I noticed you are one of) map the GPL to bioc_package? When developers make the annotation bioc_package, I believe they know the GPL.

On the other hand, Maybe there are a limited bioc_package for the GPLs.

(sorry, I click the wrong button. I just want to reply @Sean)

ADD REPLY
0
Entering edit mode

We map them "by hand" by looking at the Bioc packages available and mapping those to GPLs. This is a curated list (and we are happy to get help), so it is not guaranteed to be complete.

ADD REPLY
0
Entering edit mode

In the CellMix source code, there is a data dir there is a file, GPL2bioc.rda. there are 55. Here it's wired. Why the authors of these annotation package (Bioconductor Package Maintainer) not use GPL ID or insert the variable into the annotation package?

ADD REPLY
4
Entering edit mode
10.2 years ago

Each GPL is provided by the submitter, so there is no "standard" approach for looking up gene symbols. You'll need to examine all the GPLs and determine what column(s) could be used for lookup.

Just a note that Bioconductor can be really useful for these mapping tasks and several (mainly Affymetrix) array platforms have corresponding Bioconductor data packages that can be used to facilitate ID mapping. A mapping from GPL to Bioconductor annotation packages is available via the GEOmetadb Bioconductor package or at this gist:

"title" "gpl" "bioc_package" "manufacturer" "organism" "data_row_count"
"Illumina Sentrix Array Matrix (SAM) - GoldenGate Methylation Cancer Panel I" "GPL15380" "GGHumanMethCancerPanelv1" "Illumina" "Homo sapiens" 1536
"Illumina HumanMethylation27 BeadChip (HumanMethylation27_270596_v.1.2)" "GPL8490" "IlluminaHumanMethylation27k" "Illumina, Inc." "Homo sapiens" 27578
"Illumina HumanMethylation450 BeadChip (HumanMethylation450_15017482)" "GPL13534" "IlluminaHumanMethylation450k" "Illumina, Inc." "Homo sapiens" 485577
"GE Healthcare/Amersham Biosciences CodeLink™ ADME Rat 16-Assay Bioarray" "GPL2898" "adme16cod" "GE Healthcare" "Rattus norvegicus" 1280
"[AG] Affymetrix Arabidopsis Genome Array" "GPL71" "ag" "Affymetrix" "Arabidopsis thaliana" 8297
"[ATH1-121501] Affymetrix Arabidopsis ATH1 Genome Array" "GPL198" "ath1121501" "Affymetrix" "Arabidopsis thaliana" 22810
"[Bovine] Affymetrix Bovine Genome Array" "GPL2112" "bovine" "Affymetrix" "Bos taurus" 24128
"[Canine] Affymetrix Canine Genome 1.0 Array" "GPL3979" "canine" "Affymetrix" "Canis lupus familiaris" 23913
"[Canine_2] Affymetrix Canine Genome 2.0 Array" "GPL3738" "canine2" "Affymetrix" "Canis lupus familiaris" 43035
"[Celegans] Affymetrix C. elegans Genome Array" "GPL200" "celegans" "Affymetrix" "Caenorhabditis elegans" 22625
"[Chicken] Affymetrix Chicken Genome Array" "GPL3213" "chicken" "Affymetrix" "Gallus gallus" 38535
"[DrosGenome1] Affymetrix Drosophila Genome Array" "GPL72" "drosgenome1" "Affymetrix" "Drosophila melanogaster" 14010
"[Drosophila_2] Affymetrix Drosophila Genome 2.0 Array" "GPL1322" "drosophila2" "Affymetrix" "Drosophila melanogaster" 18952
"[Ecoli_ASv2] Affymetrix E. coli Antisense Genome Array" "GPL199" "ecoli2" "Affymetrix" "Escherichia coli K-12" 7312
"CodeLink UniSet Human I Bioarray" "GPL4191" "h10kcod" "GE Healthcare" "Homo sapiens" 10458
"GE Healthcare/Amersham Biosciences CodeLink™ UniSet Human 20K I Bioarray" "GPL2891" "h20kcod" "GE Healthcare" "Homo sapiens" 23572
"[HC_G110] Affymetrix Human Cancer Array" "GPL74" "hcg110" "Affymetrix" "Homo sapiens" 2059
"[HG-Focus] Affymetrix Human HG-Focus Target Array" "GPL201" "hgfocus" "Affymetrix" "Homo sapiens" 8793
"[HG-U133A] Affymetrix Human Genome U133A Array" "GPL96" "hgu133a" "Affymetrix" "Homo sapiens" 22283
"[HG-U133A_2] Affymetrix Human Genome U133A 2.0 Array" "GPL571" "hgu133a2" "Affymetrix" "Homo sapiens" 22277
"[HG-U133B] Affymetrix Human Genome U133B Array" "GPL97" "hgu133b" "Affymetrix" "Homo sapiens" 22645
"[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array" "GPL570" "hgu133plus2" "Affymetrix" "Homo sapiens" 54675
"[HG-U219] Affymetrix Human Genome U219 Array" "GPL13667" "hgu219" "Affymetrix" "Homo sapiens" 49386
"[HG_U95A] Affymetrix Human Genome U95A Array" "GPL91" "hgu95av2" "Affymetrix" "Homo sapiens" 12626
"[HG_U95Av2] Affymetrix Human Genome U95 Version 2 Array" "GPL8300" "hgu95av2" "Affymetrix" "Homo sapiens" 12625
"[HG_U95B] Affymetrix Human Genome U95B Array" "GPL92" "hgu95b" "Affymetrix" "Homo sapiens" 12620
"[HG_U95C] Affymetrix Human Genome U95C Array" "GPL93" "hgu95c" "Affymetrix" "Homo sapiens" 12646
"[HG_U95D] Affymetrix Human Genome U95D Array" "GPL94" "hgu95d" "Affymetrix" "Homo sapiens" 12644
"[HG_U95E] Affymetrix Human Genome U95E Array" "GPL95" "hgu95e" "Affymetrix" "Homo sapiens" 12639
"Agilent Human 1 cDNA Microarray (G4100A) [layout C]" "GPL5689" "hgug4100a" "Agilent Technologies" "Homo sapiens" 16281
"Agilent-012097 Human 1A Microarray (V2) G4110B (Feature Number version)" "GPL887" "hgug4110b" "Agilent Technologies" "Homo sapiens" 22575
"Agilent-011871 Human 1B Microarray G4111A (Feature Number version)" "GPL886" "hgug4111a" "Agilent Technologies" "Homo sapiens" 22575
"Agilent-012391 Whole Human Genome Oligo Microarray G4112A (Feature Number version)" "GPL1708" "hgug4112a" "Agilent Technologies" "Homo sapiens" 44290
"[HT_HG-U133A] Affymetrix Human Genome U133A Array (custom CDF: HTHGU133A_Hs_ENTREZG.cdf version 17.0.0)" "GPL17897" "hthgu133a" "Affymetrix" "Homo sapiens" 12092
"[HT_HG-U133B] Affymetrix HT Human Genome U133B Array [custom CDF: ENTREZ brainarray v. 14]" "GPL15396" "hthgu133b" "Affymetrix" "Homo sapiens" 7906
"[Hu35KsubA] Affymetrix Human 35K SubA Array" "GPL98" "hu35ksuba" "Affymetrix" "Homo sapiens" 8934
"[Hu35KsubB] Affymetrix Human 35K SubB Array" "GPL99" "hu35ksubb" "Affymetrix" "Homo sapiens" 8924
"[Hu35KsubC] Affymetrix Human 35K SubC Array" "GPL100" "hu35ksubc" "Affymetrix" "Homo sapiens" 8928
"[Hu35KsubD] Affymetrix Human 35K SubD Array" "GPL101" "hu35ksubd" "Affymetrix" "Homo sapiens" 8928
"[Hu6800] Affymetrix Human Full Length HuGeneFL Array" "GPL80" "hu6800" "Affymetrix" "Homo sapiens" 7129
"[HuGene-1_0-st] Affymetrix Human Gene 1.0 ST Array [transcript (gene) version]" "GPL6244" "hugene10sttranscriptcluster" "Affymetrix" "Homo sapiens" 33297
"[HuGene-1_1-st] Affymetrix Human Gene 1.1 ST Array [transcript (gene) version]" "GPL11532" "hugene11sttranscriptcluster" "Affymetrix" "Homo sapiens" 33297
"Illumina human-6 v1.0 expression beadchip" "GPL6097" "illuminaHumanv1" "Illumina Inc." "Homo sapiens" 47296
"Illumina human-6 v2.0 expression beadchip" "GPL6102" "illuminaHumanv2" "Illumina Inc." "Homo sapiens" 48702
"Illumina HumanHT-12 V3.0 expression beadchip" "GPL6947" "illuminaHumanv3" "Illumina Inc." "Homo sapiens" 49576
"Illumina HumanHT-12 V4.0 expression beadchip" "GPL10558" "illuminaHumanv4" "Illumina Inc." "Homo sapiens" 47323
"[MG_U74A] Affymetrix Murine Genome U74A Array" "GPL32" "mgu74a" "Affymetrix" "Mus musculus" 12654
"[MG_U74Av2] Affymetrix Murine Genome U74A Version 2 Array" "GPL81" "mgu74av2" "Affymetrix" "Mus musculus" 12488
"[MG_U74B] Affymetrix Murine Genome U74B Array" "GPL33" "mgu74b" "Affymetrix" "Mus musculus" 12636
"[MG_U74Bv2] Affymetrix Murine Genome U74B Version 2 Array" "GPL82" "mgu74bv2" "Affymetrix" "Mus musculus" 12477
"[MG_U74C] Affymetrix Murine Genome U74C Array" "GPL34" "mgu74c" "Affymetrix" "Mus musculus" 12728
"[MG_U74Cv2] Affymetrix Murine Genome U74 Version 2 Array" "GPL83" "mgu74cv2" "Affymetrix" "Mus musculus" 11934
"[Maize] Affymetrix Maize Genome Array" "GPL4032" "moe430a" "Affymetrix" "Zea mays" 17734
"[MOE430A] Affymetrix Mouse Expression 430A Array" "GPL339" "moe430b" "Affymetrix" "Mus musculus" 22690
"[MOE430B] Affymetrix Mouse Expression 430B Array" "GPL340" "mouse4302" "Affymetrix" "Mus musculus" 22575
"[Mouse430_2] Affymetrix Mouse Genome 430 2.0 Array" "GPL1261" "mouse430a2" "Affymetrix" "Mus musculus" 45101
"[Mu11KsubA] Affymetrix Murine 11K SubA Array" "GPL75" "mu11ksuba" "Affymetrix" "Mus musculus" 6584
"[Mu11KsubB] Affymetrix Murine 11K SubB Array" "GPL76" "mu11ksubb" "Affymetrix" "Mus musculus" 6595
"[Mu19KsubA] Affymetrix Murine 19K SubA Array" "GPL77" "mu19ksuba" "Affymetrix" "Mus musculus" 7045
"[Mu19KsubB] Affymetrix Murine 19K SubB Array" "GPL78" "mu19ksubb" "Affymetrix" "Mus musculus" 7054
"[Mu19KsubC] Affymetrix Murine 19K SubC Array" "GPL79" "mu19ksubc" "Affymetrix" "Mus musculus" 7047
"[RAE230A] Affymetrix Rat Expression 230A Array" "GPL341" "rae230a" "Affymetrix" "Rattus norvegicus" 15923
"[RAE230B] Affymetrix Rat Expression 230B Array" "GPL342" "rae230b" "Affymetrix" "Rattus norvegicus" 15333
"[Rat230_2] Affymetrix Rat Genome 230 2.0 Array" "GPL1355" "rat2302" "Affymetrix" "Rattus norvegicus" 31099
"[RG_U34A] Affymetrix Rat Genome U34 Array" "GPL85" "rgu34a" "Affymetrix" "Rattus norvegicus" 8799
"[RG_U34B] Affymetrix Rat Genome U34 Array" "GPL86" "rgu34b" "Affymetrix" "Rattus norvegicus" 8791
"[RG_U34C] Affymetrix Rat Genome U34 Array" "GPL87" "rgu34c" "Affymetrix" "Rattus norvegicus" 8789
"[RN_U34] Affymetrix Rat Neurobiology U34 Array" "GPL88" "rnu34" "Affymetrix" "Rattus norvegicus" 1322
"[RT_U34] Affymetrix Rat Toxicology U34 Array" "GPL89" "rtu34" "Affymetrix" "Rattus norvegicus" 1031
"[U133_X3P] Affymetrix Human X3P Array" "GPL1352" "u133x3p" "Affymetrix" "Homo sapiens" 61359
"[Xenopus_laevis] Affymetrix Xenopus laevis Genome Array" "GPL1318" "xenopuslaevis" "Affymetrix" "Xenopus laevis" 15611
"[Yeast_2] Affymetrix Yeast Genome 2.0 Array" "GPL2529" "yeast2" "Affymetrix" "Schizosaccharomyces pombe; Saccharomyces cerevisiae" 10928
"[YG_S98] Affymetrix Yeast Genome S98 Array" "GPL90" "ygs98" "Affymetrix" "Saccharomyces cerevisiae" 9335
"[Zebrafish] Affymetrix Zebrafish Genome Array" "GPL1319" "zebrafish" "Affymetrix" "Danio rerio" 15617
view raw platformMap.txt hosted with ❤ by GitHub

ADD COMMENT
0
Entering edit mode

Annotation from bioconductor like hgu133a need the name of hgu133a, but not GPLxxx. This makes it hard to implement my goal.

ADD REPLY
0
Entering edit mode

I have updated my answer with further details.

ADD REPLY
0
Entering edit mode

almost perfect. Only one drawback. The annotation in bioconductor sometimes is not complete as the annotation from GPL (like for some probe_id, there is no gene symbol in annotation from bioconductor but there is in annotation from GPL). I can make a `if else` to this issue. Thank you.

ADD REPLY
0
Entering edit mode

These IDs are not "TRUTH". The annotation, unfortunately, changes with time and with who performs the ID mapping and with what resources. In some cases, the Bioconductor mapping may be more complete than the GPL and sometimes the opposite. Bioconductor resources are updated every six months. The GPLs are, in general, never updated.

ADD REPLY
2
Entering edit mode

Here I find GPL annotation updated in Sep 25 2014 by GEO (like GPL570.annot.gz). there is a AnnotGPL parameter in GEOquery::getGEO. if the parameter is TRUE, the function will use information from a file like GPL570.annot.gz, while the parameter is FALSE (default), it will use an annotation in Jun 9, 2011. However, if I use hgu133plus2SYMBOL, it will drop a probe_id if it maps to more than 1 genes. It seems Bioconductor has more strict (or saying right) rule to make the annotation package.

See figure for 1007_s_at with multi genes

< image not found >

See figure for the mapping of top 30 probes.

< image not found >

So the question here is

  1. why setting AnnotGPL as false by default? If there are no such annotation file, just use the old one.
  2. Why GEO do not drop probe_id which are mapped into more than 1 gene/miRNA?

A related question is here.

ADD REPLY
0
Entering edit mode

GEO is a public repository. I don't think they should be in the business of removing data from datasets. You are, of course, free to do so. Unfortunately, I think you'll find that even after nearly 2 decades of microarrays, there are no "correct" answers as to how to deal with annotation.

ADD REPLY
0
Entering edit mode

Biologically, I believe the method used by hgu133plus2SYMBOL is better if a probe is mapped into more than 1 genes, even we cannot say it's correct.

ADD REPLY
0
Entering edit mode

You are right. I've made a mistake. Not the more, the better. I will use annotation from bioconductor as you recommend. Thank you.

ADD REPLY
0
Entering edit mode

There are 74 platformMap in total, while 13896 GPLs in GEO.

ADD REPLY
0
Entering edit mode

While the platform mapping covers less than 1% of the platforms, it covers a very significant proportion of the samples available. The more relevant number would be the number of samples in one of the platforms available from Bioconductor. I won't do that for the comment, but the GEOmetadb package can answer that question.

ADD REPLY
0
Entering edit mode

CellMix::gpl2bioc("GPL96") can do map GPL to annotation of bioconductor this. But Maybe there are a limited bioc_package for the GPLs. Some returns NA.

ADD REPLY

Login before adding your answer.

Traffic: 1172 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6