Question

A List Of Regular-Expressions For Recognizing Ids From Various Databases

12

Entering edit mode

13.5 years ago

Will 4.6k

I'm looking to create a list of regular expressions that can distinguish between the IDs of various databases? I know that some will be ambiguous but at least it could help narrow down which databases to check.

For example:

Kegg IDs: \w{,3}\d{1,}

Entrez IDs: \d*

RefSeq IDs: \w{2}_\d{1,}\.\d{1,}

Anyone have any to add? This might be a useful community resource.

database • 6.4k views

ADD COMMENT • link updated 8.3 years ago by Dchoy ▴ 40 • written 13.5 years ago by Will 4.6k

1

Entering edit mode

very useful topic

ADD REPLY • link 13.5 years ago by Casey Bergman 18k

0

Entering edit mode

changed to community wiki.

ADD REPLY • link 13.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Yeah, I've been trying to convert a 'mixed bag' of IDs and I had trouble even placing some of them. Hopefully this will help out.

ADD REPLY • link 13.5 years ago by Will 4.6k

score 9 · Answer 1 · 2011-11-02

9

Entering edit mode

13.5 years ago

Pablacious ▴ 630

Look at the MIRIAM registry:

http://www.ebi.ac.uk/miriam/main/collections/

they have assembled a large collection of expressions for the identifiers/accessions of number of databases.

ADD COMMENT • link 13.5 years ago by Pablacious ▴ 630

2

Entering edit mode

very cool. Thanks !

ADD REPLY • link 13.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

That is awesome. I never saw that before.

ADD REPLY • link 13.5 years ago by Will 4.6k

0

Entering edit mode

Yeah, its cool, we use it for very much what you asked here.

ADD REPLY • link 13.5 years ago by Pablacious ▴ 630

score 2 · Answer 2 · 2011-11-01

dbSNP: rs[0-9]+
Gene Ontology: GO:[0-9]+
DOI (from the connotea bookmarklet): (doi:)?s?(10.d{4}/S+)

LSID : from http://goo.gl/D6PT1

 String legalId =  "[A-Za-z0-9][A-Za-z0-9()+,-.=@;$_!*\'\"%]*";
 String lsidRE = "^[uU][rR][nN]:[lL][sS][iI][dD]:(" + legalID + "):(" + legalID + "):(" + legalID + ")[:]?(" + legalID + ")?$";

Ram · Answer 3 · 2011-11-01

1

Entering edit mode

13.5 years ago

Casey Bergman 18k

For INSDC accession numbers, we have used:

([A-Z]{1}[0–9]{5})|([A–Z]{2}[0−9]{6})|([A–Z]{4}[0−9]{8,9})|([A–Z]{5}[0−9]{7}))(\.[0–9]{1,3})

(Credit to Guy Cocharane at EBI)

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 13.5 years ago by Casey Bergman 18k

Ram · Answer 4 · 2011-11-02

1

Entering edit mode

13.5 years ago

Pierre Poulain ▴ 440

Protein Data Bank (PDB):

[0-9][A-Z0-9]{3}

UniProt:

[A-NR-Z][0-9][A-Z][A-Z0-9][A-Z0-9][0-9]

and

[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9]

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 13.5 years ago by Pierre Poulain ▴ 440

score 1 · Answer 5 · 2017-01-03

For gene annotations in KEGG databases such as

glutamate synthase Glt1, putative; K00264 glutamate synthase (NADPH/NADH) [EC:1.4.1.13 1.4.1.14]

To extract KEGG orthology number (KO)

(^| |\)\])(K[0-9]{5})($| |\)\])

Will work with:

"K00264 glutamate synthase ... " id at start
"...putative; K00264 glutamate..." in in middle
"Glt1, putative; K00264" id at end
"(K23102 K23010)" round brackets
"[K23102 K23010]" square brackets

Will exclude:

" K2041020 " back-extensions
" AK29310 " front-extensions

_

(^| |\)\])

captures start of a string or preceded by whitespace or has a starting round/square bracket

(K[0-9]{5})

captures the kegg id i.e. K10230, K20310. This can be replaced with the metacyc id format, etc..

($| |\)\] )

captures end of a string or followed by whitespace or has a ending round/square bracket