in R, how to get the gene symbols of GPL
1
1
Entering edit mode
9.9 years ago
Zhilong Jia ★ 2.2k

In R, how to get the gene symbols of a GPL (like GPL570)? There is no gene symbols in the annotation for some GPL from GEO. I have lots of GPL from multi platforms. for these with no gene symbol, they have GB_ACC, so more specifically, how to map GB_ACC to gene symbol in R?

Tools like ailun can do this online or WSDL or Perl. But not in R or other language I master. Are there any other tools?

GPL GEO gene-symbol • 13k views
ADD COMMENT
0
Entering edit mode

Please add some more details and context to your question.

ADD REPLY
0
Entering edit mode

Thank you. I've added some details.

ADD REPLY
0
Entering edit mode

Thank you. so the point is how do the authors of GEOmetadb (I noticed you are one of) map the GPL to bioc_package? When developers make the annotation bioc_package, I believe they know the GPL.

On the other hand, Maybe there are a limited bioc_package for the GPLs.

(sorry, I click the wrong button. I just want to reply @Sean)

ADD REPLY
0
Entering edit mode

We map them "by hand" by looking at the Bioc packages available and mapping those to GPLs. This is a curated list (and we are happy to get help), so it is not guaranteed to be complete.

ADD REPLY
0
Entering edit mode

In the CellMix source code, there is a data dir there is a file, GPL2bioc.rda. there are 55. Here it's wired. Why the authors of these annotation package (Bioconductor Package Maintainer) not use GPL ID or insert the variable into the annotation package?

ADD REPLY
4
Entering edit mode
9.9 years ago

Each GPL is provided by the submitter, so there is no "standard" approach for looking up gene symbols. You'll need to examine all the GPLs and determine what column(s) could be used for lookup.

Just a note that Bioconductor can be really useful for these mapping tasks and several (mainly Affymetrix) array platforms have corresponding Bioconductor data packages that can be used to facilitate ID mapping. A mapping from GPL to Bioconductor annotation packages is available via the GEOmetadb Bioconductor package or at this gist:

ADD COMMENT
0
Entering edit mode

Annotation from bioconductor like hgu133a need the name of hgu133a, but not GPLxxx. This makes it hard to implement my goal.

ADD REPLY
0
Entering edit mode

I have updated my answer with further details.

ADD REPLY
0
Entering edit mode

almost perfect. Only one drawback. The annotation in bioconductor sometimes is not complete as the annotation from GPL (like for some probe_id, there is no gene symbol in annotation from bioconductor but there is in annotation from GPL). I can make a `if else` to this issue. Thank you.

ADD REPLY
0
Entering edit mode

These IDs are not "TRUTH". The annotation, unfortunately, changes with time and with who performs the ID mapping and with what resources. In some cases, the Bioconductor mapping may be more complete than the GPL and sometimes the opposite. Bioconductor resources are updated every six months. The GPLs are, in general, never updated.

ADD REPLY
2
Entering edit mode

Here I find GPL annotation updated in Sep 25 2014 by GEO (like GPL570.annot.gz). there is a AnnotGPL parameter in GEOquery::getGEO. if the parameter is TRUE, the function will use information from a file like GPL570.annot.gz, while the parameter is FALSE (default), it will use an annotation in Jun 9, 2011. However, if I use hgu133plus2SYMBOL, it will drop a probe_id if it maps to more than 1 genes. It seems Bioconductor has more strict (or saying right) rule to make the annotation package.

See figure for 1007_s_at with multi genes

< image not found >

See figure for the mapping of top 30 probes.

< image not found >

So the question here is

  1. why setting AnnotGPL as false by default? If there are no such annotation file, just use the old one.
  2. Why GEO do not drop probe_id which are mapped into more than 1 gene/miRNA?

A related question is here.

ADD REPLY
0
Entering edit mode

GEO is a public repository. I don't think they should be in the business of removing data from datasets. You are, of course, free to do so. Unfortunately, I think you'll find that even after nearly 2 decades of microarrays, there are no "correct" answers as to how to deal with annotation.

ADD REPLY
0
Entering edit mode

Biologically, I believe the method used by hgu133plus2SYMBOL is better if a probe is mapped into more than 1 genes, even we cannot say it's correct.

ADD REPLY
0
Entering edit mode

You are right. I've made a mistake. Not the more, the better. I will use annotation from bioconductor as you recommend. Thank you.

ADD REPLY
0
Entering edit mode

There are 74 platformMap in total, while 13896 GPLs in GEO.

ADD REPLY
0
Entering edit mode

While the platform mapping covers less than 1% of the platforms, it covers a very significant proportion of the samples available. The more relevant number would be the number of samples in one of the platforms available from Bioconductor. I won't do that for the comment, but the GEOmetadb package can answer that question.

ADD REPLY
0
Entering edit mode

CellMix::gpl2bioc("GPL96") can do map GPL to annotation of bioconductor this. But Maybe there are a limited bioc_package for the GPLs. Some returns NA.

ADD REPLY

Login before adding your answer.

Traffic: 1601 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6