Question

What Is The Best Gene Identifier System To Use For A Bacterial Genome?

10

Entering edit mode

13.7 years ago

Michael Barton ★ 1.9k

I'm preparing to submit a draft bacterial genome to GenBank. The submission specifications require each gene have a unique identifier. What is a good identifier system to use? Start at 1 then use increasing numbers clockwise from the origin of replication?

This system seems fragile though. For instance how should newly discovered genes, post submission, be identified? What if you wish to reassemble and reannote the genome in light of new sequencing data?

Are there any interesting alternative systems? E.g. using the digest of the gene sequence.

genome annotation identifiers gene • 3.3k views

ADD COMMENT • link updated 13.7 years ago by Nicojo ★ 1.1k • written 13.7 years ago by Michael Barton ★ 1.9k

score 5 · Answer 1 · 2011-06-10

I work a lot with bacterial genome data from a lot of places. What most people use is something along the lines of "MB12345.1" (assuming that you have sequenced Miraculobacillus Bartonii), this is 2 letters of species abbreviation, followed by an ORF number (ideally in the order as the genes are found in the genome) followed by a version number (typically the version of the assembly). What some people do, particular those in the eukaryotic area, is to leave some room for newly identified genes, e.g. by using MB123450, MB123460...

Generally, these identifiers are pleasant to work with. If I would have to invent something new, I would probably put the assembly number next to the species, maybe like MB1_12345. However, it is probably advisable to stick to the conventions.

score 3 · Answer 2 · 2011-06-09

In the end, the unique identifiers will only map to one specific version of the genome and its annotation. Through re-sequencing and re-annotation, sequences might change etc., and any additional info that you encode in the ids could be invalidated.

The ids need to short, so that humans can quickly recognize / distinguish them (and many programs have a gene name limit). So I just cannot see the advantage of URNs / hashes / digests. You don't need any fancy namespaces / prefixes: it's clear what species this is.

score 2 · Answer 3 · 2011-06-09

2

Entering edit mode

13.7 years ago

Pierre Lindenbaum 165k

My two cents: I would use a short URN/URI as an unique identifier (for e.g see the, now deprecated, LSID ) that would include the version number of your assembly, of your contig, of your annotation. You could use the position in the contig to identify the gene itself.

something like:

urn:barton2011:Ecoli22:1:108.1

a short hash would be another good idea, but people would need to resolve it.

urn:barton2011:1177914dfdc89a56e

hey another idea ! a tweet ID!!! geneid:78892400524791808 :-)

ADD COMMENT • link 13.7 years ago by Pierre Lindenbaum 165k

1

Entering edit mode

yes! use the tweet ID.

ADD REPLY • link 13.7 years ago by brentp 24k

score 2 · Answer 4 · 2011-06-10

In our lab we used the first 3 letters of our bacteria, to which we added a 5-digits number (increasing clockwise from the origin of replication) incremented by 10. This gives you space in case you have to add new features to your GBK file

Example:

A bug called "E. coli strain K" would produce:

ECK_00010

ECK_00020

ECK_00030

etc...

score 2 · Answer 5 · 2011-06-12

Another thing to consider: we are increasingly realizing that each individual (or clonal population thereof) that is sequenced is proving to contain differences. The most accepted and acknowledged ones are of course SNPs, but indels and copy number polymorphisms are also very common. In addition to those, there are a lot of rearrangements (at least for organisms with linear chromosomes) and horizontal gene transfers from other individuals/populations...

All this tells me that at some point not too far away we will be considering the sequence of each individual within a species independently from the others. We will eventually need to be able to distinguish between strains.

Note that a strain (especially bacterial) cultivated in vitro for hundreds/thousands of generations is likely to have a different genome from that of the original strain. So even with the same name, it might not be the same.

Today, I haven't seen anyone take this into account in their sequencing/annotating/naming efforts. But I do know that it is an issue in some fields of research, like parasitology. I have seen, at a conference, several attendees verbally fighting over conflicting results. In the end they agreed that although they were using the same "strain", it had been cultivated for a long time and each of their aliquots had certainly evolved very differently.

I would suggest including something in the name to specify which isolate you are annotating. For that matter, I think Pierre's suggestion may be a good one (URN or URI).