Regular Expression For Real World Gene Symbols?
1
2
Entering edit mode
11.9 years ago
Samuel Lampa ★ 1.3k

What would be a suitable regular expression for representing gene symbols of various species (homo sapiens, mus musculus, rat norwegius etc)?

Based on the HUGO Gene Nomenclature Committe FAQ: "The "symbol" is a unique series of Latin letters (upper case in human), often with Arabic numerals, which should ideally be no longer than six characters in length"

That would result in a regex like [A-Za-z0-9]{1,6} (or [A-Za-z0-9]+), but looking at some real world data, I have found gene names containing other characters as well, such as dash ("-") *, so I was wondering if you know of more such oddities that need to be taken care of?

  • Seemingly this is some mitochondrial genes. The names are on the form "mt-[A-Za-z0-9"]
genetics • 5.1k views
ADD COMMENT
1
Entering edit mode

Probably the simplest approach is to download the gene info and gene synonyms tables from NCBI and design a regex that captures as much of that as you like.

ADD REPLY
0
Entering edit mode

I second Sean here. There are no static conventions for naming of gene symbols. It gets more complicated with numerous designations: culture or cell strain, splicing variants, etc. My suggestion is to find all the posible permutations of your symbols and work your grep to catch all the symbols you need. It's a lot work, but you can test them out on a tester such as REGex TESTER.

ADD REPLY
5
Entering edit mode
11.9 years ago
hurfdurf ▴ 490

The full HGNC specs are here. Every source of gene names has different exception cases, and even the official HGNC names get cleaned up periodically, so a valid name can become invalid over time. If you are writing a validator for a known species just do a case insensitive lookup to an actual gene names table rather than use a regexp validator. I'd start with EnsMART and/or UCSC's gene name tables as a starting set.

ADD COMMENT
0
Entering edit mode

Given a file with a list of genes mixed gene names and HGNC symbols and names, you may use the following command to keep all HGNC gene symbols (not gene names):

grep -E '^[A-Z0-9-]+$|^C[0-9XY]+orf[0-9]+$'
ADD REPLY

Login before adding your answer.

Traffic: 1701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6