What would be a suitable regular expression for representing gene symbols of various species (homo sapiens, mus musculus, rat norwegius etc)?
Based on the HUGO Gene Nomenclature Committe FAQ: "The "symbol" is a unique series of Latin letters (upper case in human), often with Arabic numerals, which should ideally be no longer than six characters in length"
That would result in a regex like [A-Za-z0-9]{1,6}
(or [A-Za-z0-9]+
), but looking at some real world data, I have found gene names containing other characters as well, such as dash ("-") *, so I was wondering if you know of more such oddities that need to be taken care of?
- Seemingly this is some mitochondrial genes. The names are on the form "mt-[A-Za-z0-9"]
Probably the simplest approach is to download the gene info and gene synonyms tables from NCBI and design a regex that captures as much of that as you like.
I second Sean here. There are no static conventions for naming of gene symbols. It gets more complicated with numerous designations: culture or cell strain, splicing variants, etc. My suggestion is to find all the posible permutations of your symbols and work your
grep
to catch all the symbols you need. It's a lot work, but you can test them out on a tester such as REGex TESTER.