Hello all,
I am building a pipeline to perform searches on the latest ensembl bacteria release locally. However, in trying to parse the data to get more information from other databases I've come across an issue.
Since release 56, all stable_id values are set to a unique 15-character value based on a checksum performed against various parameters specific to that sequence and species (described more here: https://www.ebi.ac.uk/about/news/updates-from-data-resources/ensembl-bacteria/). These ids look something like this: ENSB:sKExEba7twfpOpI.
My problem is that there appears to be no cross-references anywhere on the internet or on the ensembl website or others to different unique identifiers (like NCBI, UniProt etc.). The release also seems to have removed all reference to their old ensembl ids.
The best strategy I can think of now is a quite long winded. Take the TSV files from the ensembl release, these contain a crossref to the Genbank id for the whole contig that gene/protein is one (e.g. LMIU01000024.1). Then I need to access that genbank entry or download it from the NCBI website, from there I can parse the file and extract the protein_id and locus_tag which I can use to query Uniprot and get the uniprot accession (if it differs).
This seems super complicated though. Am I missing something much easier?
Kind regards,
Tagging: Ben_Ensembl
Ensembl has to be the source? You could start with the genomes from GenBank.
This is true. I liked the quality inclusion criteria of ensembl bacteria which is why I began with them (and their ftp site). But I suppose I could just take the accession numbers of their genome list and get them from GenBank instead. Now I've started writing scripts / building pipeline with genomes sourced them from it would be nice to continue but I could take your approach if nothing simpler comes along.