Question

Ensembl stable ID cross-references

0

Entering edit mode

19 months ago

J • 0

Hello all,

I am building a pipeline to perform searches on the latest ensembl bacteria release locally. However, in trying to parse the data to get more information from other databases I've come across an issue.

Since release 56, all stable_id values are set to a unique 15-character value based on a checksum performed against various parameters specific to that sequence and species (described more here: https://www.ebi.ac.uk/about/news/updates-from-data-resources/ensembl-bacteria/). These ids look something like this: ENSB:sKExEba7twfpOpI.

My problem is that there appears to be no cross-references anywhere on the internet or on the ensembl website or others to different unique identifiers (like NCBI, UniProt etc.). The release also seems to have removed all reference to their old ensembl ids.

The best strategy I can think of now is a quite long winded. Take the TSV files from the ensembl release, these contain a crossref to the Genbank id for the whole contig that gene/protein is one (e.g. LMIU01000024.1). Then I need to access that genbank entry or download it from the NCBI website, from there I can parse the file and extract the protein_id and locus_tag which I can use to query Uniprot and get the uniprot accession (if it differs).

This seems super complicated though. Am I missing something much easier?

Kind regards,

database parsing cross-reference ensembl • 1.5k views

ADD COMMENT • link updated 19 months ago by Ben Moore ★ 2.4k • written 19 months ago by J • 0

0

Entering edit mode

Tagging: Ben_Ensembl

ADD REPLY • link 19 months ago by GenoMax 154k

0

Entering edit mode

I am building a pipeline to perform searches on the latest ensembl bacteria release locally.

Ensembl has to be the source? You could start with the genomes from GenBank.

ADD REPLY • link 19 months ago by GenoMax 154k

0

Entering edit mode

This is true. I liked the quality inclusion criteria of ensembl bacteria which is why I began with them (and their ftp site). But I suppose I could just take the accession numbers of their genome list and get them from GenBank instead. Now I've started writing scripts / building pipeline with genomes sourced them from it would be nice to continue but I could take your approach if nothing simpler comes along.

ADD REPLY • link 19 months ago by J • 0

score 0 · Answer 1 · 2024-04-08

Hi J - as you pointed out, there were some recent changes to the bacterial genome annotation in Ensembl outlined in the blog you referenced and here: https://www.ensembl.info/2023/07/24/ensembl-bacteria-releases-new-annotation-for-all-of-its-genomes/#:~:text=A%20common%20annotation%20pipeline%20has,microbial%20groups%20at%20EMBL%2DEBI.

So, yes, using the TSV file as you suggested is currently the best way to retrieve these cross references. However, we are currently working to add UniProt identifiers, which will be included in an upcoming release.