Question

Finding Gene Symbol Synonyms

20

Entering edit mode

14.9 years ago

Mike Dewar ★ 1.6k

Some HGNC Gene Symbols have synonyms that are more familiar to biologists of particular breeds. For example, "SELL" means little to a immunologist, whereas SELL's alias "CD62L", means rather a lot. Showing the biologist a list of gene names and saying "do any of these ring bells" seems to be an important part of the process of selecting important genes (or, rather, its validation), and hence I'd like to make sure they see the gene names that make sense to them.

My question, therefore, is: does anyone know a simple method to retrieve gene synonyms? I don't want to do any enrichment, or clustering or normalisation, I just need a mapping from HGNC symbol -> synonyms. I can't quite figure out how to persuade biomart to do this.

In addition, are there certain sets of symbols that are preferred by some communities? For example, do I need to search through all the synonymous symbols, or can I just ask biomart (or something) to return a particular set of gene symbols?

annotation • 26k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 14.9 years ago by Mike Dewar ★ 1.6k

Ram · Answer 1 · 2010-06-08

20

Entering edit mode

14.9 years ago

Andrew Su 4.9k

If you wanted to this analysis for a large number of gene symbols and/or from the command line, I would first download gene_info.gz from here, and then use awk to parse. For example, SELL has the Entrez Gene ID of 6402, so:

gzip -cd gene_info.gz | awk '$2==6402{print $5}'

produces this output:

CD62L|LAM1|LECAM1|LEU8|LNHR|LSEL|LYAM1|PLNHR|TQ1

(The second column of gene_info is Entrez Gene ID, the fifth column has the aliases)

You can also do a similar awk parsing based on the gene symbol directly, but then you probably also want to limit it by organism (e.g., human=9606). For example:

gzip -cd gene_info.gz | awk '$3=="SELL"&&$1==9606{print $5}'

produces the same output as above...

To get a file that translates all human gene symbols to their aliases:

gzip -cd gene_info.gz | awk '$1==9606{print $3"\t"$5}' > output.txt

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 14.9 years ago by Andrew Su 4.9k

4

Entering edit mode

Four years since you've posted this, I've just found it. Exactly what I was looking for, thanks. Similarly to the initial poster, I am just interested in H. sapiens genes. This means you don't need to download the rather large full list from Entrez but can limit yourself to:

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by Stu@IC ▴ 40

0

Entering edit mode

Very useful, thanks!

One precision. If you are searching for official symbol (=$11), sets the -F option to "\t". For example:

gzip -cd gene_info.gz | awk -F "\t" '$1==9606&&$3==&SELL&{print $3"\t"$5"\t"$11}'

## $3 = Symbol
## $11 = Symbol_from_nomenclature_authority

ADD REPLY • link updated 6.6 years ago by Ram 45k • written 10.5 years ago by LGMgeo ▴ 110

0

Entering edit mode

I have an R wrapper for this at https://github.com/oganm/geneSynonym. It extracts synonym information about the species of interest and allows you to query any gene symbol for synonyms.

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.5 years ago by oganm ▴ 60

Ram · Answer 2 · 2010-06-08

7

Entering edit mode

14.9 years ago

Khader Shameer 18k

GeneALaCart from GeneCards will be a good start. There will be definitely other resources which can do this type of mapping, but from my ID mapping experience GeneCards provides good number of aliases & descriptions for human genes.

A quick search using GeneALaCart got the following aliases for CD62L

Copied from the output CSV file :

Gene Symbol : SELL 
Entrez_Gene ID : 5579    
HGNC_ID : 9395
Aliases : LEU8 |LAM1 |LECAM1 |hLHRc |Leu-8 |TQ1 |LAM-1 |LSEL |PLNHR |LNHR |CD62L |gp90-MEL |LYAM1 |Lyam-1

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 14.9 years ago by Khader Shameer 18k

0

Entering edit mode

It is very strange that the Entrez_Gene ID of your instance is 5579 because it is the Entrez_Gene ID of PRKCB protein kinase C. It should be 6402.

ADD REPLY • link 14.9 years ago by Fred Fleche 4.3k

Ram · Answer 3 · 2010-06-09

6

Entering edit mode

14.9 years ago

Fred Fleche 4.3k

In your case, may be the easiest way would be to use the HGNC output data webpage.

You can easily check the fields of your choice like:

Approved Symbol
Aliases
Entrez Gene ID

Then also check

Select Status Approved
Select all Chromosomes

Then press submit to get the listing as text file that you can either use in Excel or insert in a sql database.

in the case of the SELL gene reported previously you get :

SELL#LSEL, LAM1, LAM-1, hLHRc, Leu-8, Lyam-1, PLNHR, CD62L#6402

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 14.9 years ago by Fred Fleche 4.3k

1

Entering edit mode

Fred - IMHO stands for "in my honest opinion"! I think you have a fan, rather than a competitor...

ADD REPLY • link 14.9 years ago by Mike Dewar ★ 1.6k

1

Entering edit mode

I find French online acronyms to be very difficult aslo, though typically more fun! I went with the awk-based answer above as it involves less clicking, though I think your answer will be very helpful to others coming across this question...

ADD REPLY • link 14.9 years ago by Mike Dewar ★ 1.6k

1

Entering edit mode

IMHO, this has been my fault. in the future I'll try to be more precise and academic ;)

ADD REPLY • link 14.9 years ago by Jorge Amigo 14k

0

Entering edit mode

IMHO this is, by far, the easiest way of retrieving such data

ADD REPLY • link 14.9 years ago by Jorge Amigo 14k

0

Entering edit mode

@Jorge. You are very welcome to click on the button "Add Another Answers" and demonstrate how to get the listing through IMHO. I think everybody here is eager to learn new method. So do not hesitate to expose/share your method.

ADD REPLY • link 14.9 years ago by Fred Fleche 4.3k

0

Entering edit mode

@Jorge. Actually I didn't know this english acronym. I thought it was a bioinformatics server. And don't worry I will never considere you or other as competitors. I am here to learn new solutions. I am glad you like my solution. Feel free to click the "Click to set this answer as your accepted answer" button ;-)

ADD REPLY • link 14.9 years ago by Fred Fleche 4.3k

Ram · Answer 4 · 2010-06-08

4

Entering edit mode

14.9 years ago

Pierre Lindenbaum 166k

unfortunately , this CD62L is not present in the UCSC DB, however, here is a query for another gene (PRBC1) listing the position and the aliases.

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A  -D hg18 -e '
select distinct
 K.chrom,
 K.txStart,
 K.txEnd,
 A1.alias,
 A2.alias
from
 knownGene as K,
 kgAlias as A1,
 kgAlias as A2
where
 K.name=A1.kgID and
 K.name=A2.kgID and
 A1.alias<A2.alias and
 (A1.alias="PRKCB1" or A2.alias="PRKCB1") '

result:

+-------+----------+----------+------------+------------+
| chrom | txStart  | txEnd    | alias      | alias      |
+-------+----------+----------+------------+------------+
| chr16 | 23754822 | 24134810 | NM_002738  | PRKCB1     |
| chr16 | 23754822 | 24134810 | NP_002729  | PRKCB1     |
| chr16 | 23754822 | 24134810 | P05771-2   | PRKCB1     |
| chr16 | 23754822 | 24134810 | PKCB       | PRKCB1     |
| chr16 | 23754822 | 24134810 | PRKCB      | PRKCB1     |
| chr16 | 23754822 | 24134810 | PRKCB1     | uc002dmc.1 |
| chr16 | 23754822 | 24139063 | KPCB_HUMAN | PRKCB1     |
| chr16 | 23754822 | 24139063 | NM_212535  | PRKCB1     |
| chr16 | 23754822 | 24139063 | NP_997700  | PRKCB1     |
| chr16 | 23754822 | 24139063 | O43744     | PRKCB1     |
| chr16 | 23754822 | 24139063 | P05127     | PRKCB1     |
| chr16 | 23754822 | 24139063 | P05771     | PRKCB1     |
| chr16 | 23754822 | 24139063 | PKCB       | PRKCB1     |
| chr16 | 23754822 | 24139063 | PRKCB      | PRKCB1     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q15138     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q93060     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q9UE49     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q9UE50     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q9UEH8     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q9UJ30     |
| chr16 | 23754822 | 24139063 | PRKCB1     | Q9UJ33     |
| chr16 | 23754822 | 24139063 | PRKCB1     | uc002dmd.1 |
+-------+----------+----------+------------+------------+

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 14.9 years ago by Pierre Lindenbaum 166k

2

Entering edit mode

Pierre : I am afraid what you have here is mostly ID mapping to Uniprot (those starts with Q) and NCBI identifiers (NM). I am afraid they may not qualify as Synonyms of Gene names.

ADD REPLY • link 14.9 years ago by Khader Shameer 18k

0

Entering edit mode

presumably this would work if you used the official gene symbol SELL?

ADD REPLY • link 14.9 years ago by Andrew Su 4.9k

0

Entering edit mode

Pierre : I am afraid what you have here is ID mapping to Uniprot (those starts with Q) and NCBI identifiers (NM). I am afraid they may not qualify as Synonyms of Gene names.

@ Andrew : Could you clarify this.

ADD REPLY • link 14.9 years ago by Khader Shameer 18k

0

Entering edit mode

Pierre : I am afraid what you have here is ID mapping to Uniprot (those starts with Q) and NCBI identifiers (NM). I am afraid they may not qualify as Synonyms of Gene names.

ADD REPLY • link 14.9 years ago by Khader Shameer 18k

0

Entering edit mode

Pierre : I am afraid what you have here is mostly ID mapping to Uniprot (those starts with Q) and NCBI identifiers (NM). I am afraid they may not qualify as Synonyms of Gene names. @ Andrew : Could you clarify this

ADD REPLY • link 14.9 years ago by Khader Shameer 18k

0

Entering edit mode

yes, it does work with SELL

ADD REPLY • link 14.9 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Just checked for PRKCB at GeneAlaCart it retrieves

PRKCB2 |MGC41878 |PRKCB1 |PKCB |PKC-B |PKC-beta |EC 2.7.11.13

as Aliases. Curious to know if we can get such gene synonyms via UCSC DB.

ADD REPLY • link updated 6.6 years ago by Ram 45k • written 14.9 years ago by Khader Shameer 18k

0

Entering edit mode

@Khader, ah, ok :-)

ADD REPLY • link 14.9 years ago by Pierre Lindenbaum 166k