What is the stabiity of Ensembl IDs at the gene, protein and transcript levels ?
I know human is pretty solid (but still not completely 1:1:1:1 with HGNC, Swiss-Prot and Entrez Gene) and there is a push to close mouse and Zebrafish via Havana
However I guess most of the other assemblies are provisional to some extent and the gene IDs therfore are difficult to lock down across re-builds, that will also cause some churn in the transcripts and ORFs
Is there any data on this ? The reason for asking is in regard to citing them and/or including raw sequence as supplimentary data
We may be able to dig up some stats from when we redid genebuilds, but these will be patchy. The genebuild pipeline produces a stats file but we don't, as a matter of course, keep them. It will depend on whether the individual who ran the genebuild decided to keep them. Which species are you interested in?
Thanks for the offer, but all things considered, for the publication we are working on (a couple of genes in ~ 20 species) I have decided to (ask the eds if we can) supply the FASTA strings of what we used as supplimentary data. This simplifies everything for re-use and, besides, we extended many Ensembl ORFs using ESTs or TSAs. JFTR have the team ever looked at "churn" rates i.e. changes in the ORF sets between gene builds ?