Question

How to get bacterial gene names if I know their coordinates

2

Entering edit mode

10.7 years ago

bioslayer ▴ 50

My ultimate goal is to do a comparative GO enrichment analysis on a collection of bacterial genes from over 3000 genomes. Within each genome I have a list of around 100 or so interesting genes, their start/stop coordinate. I have the names and PRODUCT annotation for some of the genes but my parsing of the EMBL and GFF files did not return names for the rest.

Now, to do a GO term analysis I need to get the gene IDs. I was hoping that GO searches could be sequence based where I don't have to worry about getting the list of names for every gene of interest. The closest I got was through Blast2Go. Which is extremely slow and therefore non-feasible. Another option I thought about was to programmatically get the Gene names by submitting the coordinates to ENSEMBLE/BioMart through their API but both resources do not seem to support bacterial genomes.

I am referring this to the biostars community because I could not find answers on my own. my two questions are: what is the best strategy to get the corresponding gene names if I already have the bacterial EMBL genome ID and the start stop coordinates for each gene? Is it every possible to do a sequence based GO analysis?

GO-enrichment gene-names • 3.6k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by bioslayer ▴ 50

0

Entering edit mode

Have you tried the Ensembl Genomes REST API for bacterial genomes?

ADD REPLY • link 10.7 years ago by Denise CS ★ 5.2k

0

Entering edit mode

Yes, REST was among the options I considered. I saw the Perl client it has and ran through its documentation but I can not really tell how to pass it a genome ID and coordinates for me to get back the gene names. It will be great if you can provide an example of a query to illustrate this.

Thanks

ADD REPLY • link 10.7 years ago by bioslayer ▴ 50

2

Entering edit mode

Sure! Which EMBL accession have you got? If you have got for example U00096 (E. coli K-12) and know the name we (Ensembl Genomes) use for that species, you can use this endpoint. If you don't know the species name, you should get that info from this endpoint.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by Denise CS ★ 5.2k

1

Entering edit mode

This simplifies it very nicely, I can smoothly make a wrapper for these end points. Going with the E.coli example, suppose I have two entries as follows:

574737    575753    -
4056471   4056555   +

So using the first endpoint I will construct a query like this, feature can be anything "CDS, ncRNA, gene...etc"

http://rest.ensemblgenomes.org/overlap/region/escherichia_coli_str_k_12_substr_mg1655/U00096:574737..575753?feature=gene;content-type=application/json

http://rest.ensemblgenomes.org/overlap/region/escherichia_coli_str_k_12_substr_mg1655/U00096:4056471..4056555?feature=gene;content-type=application/json

The first query correctly returns the gene name (insH1) and other alternative names which is really great. The second query won't return anything probably due to lack of pre-existing annotation which is a set back. I guess my option for the ones that do not return an annotation is to do a blastx against uniprot/swissprot and get the GO terms associated with any highly scoring matches.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by bioslayer ▴ 50