A List Of Regular-Expressions For Recognizing Ids From Various Databases
5
12
Entering edit mode
13.2 years ago
Will 4.6k

I'm looking to create a list of regular expressions that can distinguish between the IDs of various databases? I know that some will be ambiguous but at least it could help narrow down which databases to check.

For example:

Kegg IDs: \w{,3}\d{1,}

Entrez IDs: \d*

RefSeq IDs: \w{2}_\d{1,}\.\d{1,}

Anyone have any to add? This might be a useful community resource.

database • 6.1k views
ADD COMMENT
1
Entering edit mode

very useful topic

ADD REPLY
0
Entering edit mode

changed to community wiki.

ADD REPLY
0
Entering edit mode

Yeah, I've been trying to convert a 'mixed bag' of IDs and I had trouble even placing some of them. Hopefully this will help out.

ADD REPLY
9
Entering edit mode
13.2 years ago
Pablacious ▴ 630

Look at the MIRIAM registry:

http://www.ebi.ac.uk/miriam/main/collections/

they have assembled a large collection of expressions for the identifiers/accessions of number of databases.

ADD COMMENT
2
Entering edit mode

very cool. Thanks !

ADD REPLY
0
Entering edit mode

That is awesome. I never saw that before.

ADD REPLY
0
Entering edit mode

Yeah, its cool, we use it for very much what you asked here.

ADD REPLY
2
Entering edit mode
13.2 years ago
  • dbSNP: rs[0-9]+
  • Gene Ontology: GO:[0-9]+
  • DOI (from the connotea bookmarklet): (doi:)?s?(10.d{4}/S+)
  • LSID : from http://goo.gl/D6PT1

     String legalId =  "[A-Za-z0-9][A-Za-z0-9()+,-.=@;$_!*\'\"%]*";
     String lsidRE = "^[uU][rR][nN]:[lL][sS][iI][dD]:(" + legalID + "):(" + legalID + "):(" + legalID + ")[:]?(" + legalID + ")?$";
    
ADD COMMENT
1
Entering edit mode
13.2 years ago

For INSDC accession numbers, we have used:

([A-Z]{1}[0–9]{5})|([A–Z]{2}[0−9]{6})|([A–Z]{4}[0−9]{8,9})|([A–Z]{5}[0−9]{7}))(\.[0–9]{1,3})

(Credit to Guy Cocharane at EBI)

ADD COMMENT
1
Entering edit mode
13.2 years ago

Protein Data Bank (PDB):

[0-9][A-Z0-9]{3}

UniProt:

[A-NR-Z][0-9][A-Z][A-Z0-9][A-Z0-9][0-9]

and

[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9]
ADD COMMENT
1
Entering edit mode
8.0 years ago
Dchoy ▴ 40

For gene annotations in KEGG databases such as

glutamate synthase Glt1, putative; K00264 glutamate synthase (NADPH/NADH) [EC:1.4.1.13 1.4.1.14]

To extract KEGG orthology number (KO)

(^| |\)\])(K[0-9]{5})($| |\)\])

Will work with:

  1. "K00264 glutamate synthase ... " id at start
  2. "...putative; K00264 glutamate..." in in middle
  3. "Glt1, putative; K00264" id at end
  4. "(K23102 K23010)" round brackets
  5. "[K23102 K23010]" square brackets

Will exclude:

  1. " K2041020 " back-extensions
  2. " AK29310 " front-extensions

_

(^| |\)\])

captures start of a string or preceded by whitespace or has a starting round/square bracket

(K[0-9]{5})

captures the kegg id i.e. K10230, K20310. This can be replaced with the metacyc id format, etc..

($| |\)\] )

captures end of a string or followed by whitespace or has a ending round/square bracket

ADD COMMENT

Login before adding your answer.

Traffic: 1797 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6