Finding Gene Symbol Synonyms
4
20
Entering edit mode
14.5 years ago
Mike Dewar ★ 1.6k

Some HGNC Gene Symbols have synonyms that are more familiar to biologists of particular breeds. For example, "SELL" means little to a immunologist, whereas SELL's alias "CD62L", means rather a lot. Showing the biologist a list of gene names and saying "do any of these ring bells" seems to be an important part of the process of selecting important genes (or, rather, its validation), and hence I'd like to make sure they see the gene names that make sense to them.

My question, therefore, is: does anyone know a simple method to retrieve gene synonyms? I don't want to do any enrichment, or clustering or normalisation, I just need a mapping from HGNC symbol -> synonyms. I can't quite figure out how to persuade biomart to do this.

In addition, are there certain sets of symbols that are preferred by some communities? For example, do I need to search through all the synonymous symbols, or can I just ask biomart (or something) to return a particular set of gene symbols?

annotation • 25k views
ADD COMMENT
20
Entering edit mode
14.5 years ago
Andrew Su 4.9k

If you wanted to this analysis for a large number of gene symbols and/or from the command line, I would first download gene_info.gz from here, and then use awk to parse. For example, SELL has the Entrez Gene ID of 6402, so:

gzip -cd gene_info.gz | awk '$2==6402{print $5}'

produces this output:

CD62L|LAM1|LECAM1|LEU8|LNHR|LSEL|LYAM1|PLNHR|TQ1

(The second column of gene_info is Entrez Gene ID, the fifth column has the aliases)

You can also do a similar awk parsing based on the gene symbol directly, but then you probably also want to limit it by organism (e.g., human=9606). For example:

gzip -cd gene_info.gz | awk '$3=="SELL"&&$1==9606{print $5}'

produces the same output as above...

To get a file that translates all human gene symbols to their aliases:

gzip -cd gene_info.gz | awk '$1==9606{print $3"\t"$5}' > output.txt
ADD COMMENT
4
Entering edit mode

Four years since you've posted this, I've just found it. Exactly what I was looking for, thanks. Similarly to the initial poster, I am just interested in H. sapiens genes. This means you don't need to download the rather large full list from Entrez but can limit yourself to:

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz

ADD REPLY
0
Entering edit mode

Very useful, thanks!

One precision. If you are searching for official symbol (=$11), sets the -F option to "\t". For example:

gzip -cd gene_info.gz | awk -F "\t" '$1==9606&&$3==&SELL&{print $3"\t"$5"\t"$11}'

## $3 = Symbol
## $11 = Symbol_from_nomenclature_authority
ADD REPLY
0
Entering edit mode

I have an R wrapper for this at https://github.com/oganm/geneSynonym. It extracts synonym information about the species of interest and allows you to query any gene symbol for synonyms.

ADD REPLY
7
Entering edit mode
14.5 years ago

GeneALaCart from GeneCards will be a good start. There will be definitely other resources which can do this type of mapping, but from my ID mapping experience GeneCards provides good number of aliases & descriptions for human genes.

A quick search using GeneALaCart got the following aliases for CD62L

Copied from the output CSV file :

Gene Symbol : SELL 
Entrez_Gene ID : 5579    
HGNC_ID : 9395
Aliases : LEU8 |LAM1 |LECAM1 |hLHRc |Leu-8 |TQ1 |LAM-1 |LSEL |PLNHR |LNHR |CD62L |gp90-MEL |LYAM1 |Lyam-1
ADD COMMENT
0
Entering edit mode

It is very strange that the Entrez_Gene ID of your instance is 5579 because it is the Entrez_Gene ID of PRKCB protein kinase C. It should be 6402.

ADD REPLY
6
Entering edit mode
14.5 years ago

In your case, may be the easiest way would be to use the HGNC output data webpage.

You can easily check the fields of your choice like:

  • Approved Symbol
  • Aliases
  • Entrez Gene ID

Then also check

  • Select Status Approved
  • Select all Chromosomes

Then press submit to get the listing as text file that you can either use in Excel or insert in a sql database.

in the case of the SELL gene reported previously you get :

SELL#LSEL, LAM1, LAM-1, hLHRc, Leu-8, Lyam-1, PLNHR, CD62L#6402
ADD COMMENT
1
Entering edit mode

Fred - IMHO stands for "in my honest opinion"! I think you have a fan, rather than a competitor...

ADD REPLY
1
Entering edit mode

I find French online acronyms to be very difficult aslo, though typically more fun! I went with the awk-based answer above as it involves less clicking, though I think your answer will be very helpful to others coming across this question...

ADD REPLY
1
Entering edit mode

IMHO, this has been my fault. in the future I'll try to be more precise and academic ;)

ADD REPLY
0
Entering edit mode

IMHO this is, by far, the easiest way of retrieving such data

ADD REPLY
0
Entering edit mode

@Jorge. You are very welcome to click on the button "Add Another Answers" and demonstrate how to get the listing through IMHO. I think everybody here is eager to learn new method. So do not hesitate to expose/share your method.

ADD REPLY
0
Entering edit mode

@Jorge. Actually I didn't know this english acronym. I thought it was a bioinformatics server. And don't worry I will never considere you or other as competitors. I am here to learn new solutions. I am glad you like my solution. Feel free to click the "Click to set this answer as your accepted answer" button ;-)

ADD REPLY
4
Entering edit mode
14.5 years ago

unfortunately , this CD62L is not present in the UCSC DB, however, here is a query for another gene (PRBC1) listing the position and the aliases.

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A  -D hg18 -e '
select distinct
 K.chrom,
 K.txStart,
 K.txEnd,
 A1.alias,
 A2.alias
from
 knownGene as K,
 kgAlias as A1,
 kgAlias as A2
where
 K.name=A1.kgID and
 K.name=A2.kgID and
 A1.alias<A2.alias and
 (A1.alias="PRKCB1" or A2.alias="PRKCB1") '

result:

+-------+----------+----------+------------+------------+
| chrom | txStart  | txEnd    | alias      | alias      |
+-------+----------+----------+------------+------------+
| chr16 | 23754822 | 24134810 | NM_002738  | PRKCB1     |
| chr16 | 23754822 | 24134810 | NP_002729  | PRKCB1     |
| chr16 | 23754822 | 24134810 | P05771-2   | PRKCB1     |
| chr16 | 23754822 | 24134810 | PKCB       | PRKCB1     |
| chr16 | 23754822 | 24134810 | PRKCB      | PRKCB1     |
| chr16 | 23754822 | 24134810 | PRKCB1     | uc002dmc.1 |
| chr16 | 23754822 | 24139063 | KPCB_HUMAN | PRKCB1     |
| chr16 | 23754822 | 24139063 | NM_212535  | PRKCB1     |
| chr16 | 23754822 | 24139063 | NP_997700  | PRKCB1     |
| chr16 | 23754822 | 24139063 | O43744     | PRKCB1     |
| chr16 | 23754822 | 24139063 | P05127     | PRKCB1     |
| chr16 | 23754822 | 24139063 | P05771     | PRKCB1     |
| chr16 | 23754822 | 24139063 | PKCB       | PRKCB1     |
| chr16 | 23754822 | 24139063 | PRKCB      | PRKCB1     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q15138     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q93060     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q9UE49     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q9UE50     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q9UEH8     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q9UJ30     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q9UJ33     |
| chr16 | 23754822 | 24139063 | PRKCB1     | uc002dmd.1 |
+-------+----------+----------+------------+------------+
ADD COMMENT
2
Entering edit mode

Pierre : I am afraid what you have here is mostly ID mapping to Uniprot (those starts with Q) and NCBI identifiers (NM). I am afraid they may not qualify as Synonyms of Gene names.

ADD REPLY
0
Entering edit mode

presumably this would work if you used the official gene symbol SELL?

ADD REPLY
0
Entering edit mode

Pierre : I am afraid what you have here is ID mapping to Uniprot (those starts with Q) and NCBI identifiers (NM). I am afraid they may not qualify as Synonyms of Gene names.

@ Andrew : Could you clarify this.

ADD REPLY
0
Entering edit mode

Pierre : I am afraid what you have here is ID mapping to Uniprot (those starts with Q) and NCBI identifiers (NM). I am afraid they may not qualify as Synonyms of Gene names.

ADD REPLY
0
Entering edit mode

Pierre : I am afraid what you have here is mostly ID mapping to Uniprot (those starts with Q) and NCBI identifiers (NM). I am afraid they may not qualify as Synonyms of Gene names. @ Andrew : Could you clarify this

ADD REPLY
0
Entering edit mode

yes, it does work with SELL

ADD REPLY
0
Entering edit mode

Just checked for PRKCB at GeneAlaCart it retrieves

PRKCB2 |MGC41878 |PRKCB1 |PKCB |PKC-B |PKC-beta |EC 2.7.11.13

as Aliases. Curious to know if we can get such gene synonyms via UCSC DB.

ADD REPLY
0
Entering edit mode

@Khader, ah, ok :-)

ADD REPLY

Login before adding your answer.

Traffic: 2683 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6