I am currently working on ensembl data, and I read about the stable gene identifiers, https://www.ensembl.org/info/genome/stable_ids/index.html
Apparently, gene IDs are stable between releases but not necessarily gene names. I got curious and wanted to know if a lot of gene name changes occurred between consecutive versions.
I wrote a small R script to connect to the different versions of ensembl and get the stable IDs that have different gene names between to consecutive versions (cf end of the post, the script needs biomaRt and ggplot2)
There is a stable rate of changes of a few hundred gene names, but a very high peak between versions 88 and 89. Most of it come from gene names changing from "AC######" or "AL######" to "RP## - ######". Which seems to correspond to BAC clones from the "Vertebrate Genome Annotation" project. What Are These Rp11 'Genes' In The Genome?
If anyone could clear this, it would be interesting. It seems that ensembl renamed the "RP11" pseudogenes from version 88 to 89
I have not worked a lot with ensembl data, so I may have made mistakes. Let me know if you have any suggestions, comments or remarks about this small script or the results.
# This script is designed to plot the number of gene name changes between the
# different ensembl versions
# This will store the changes between a version a the one before
changes_between_versions <- data.frame(version = c(), changes = c())
prev_version = 76
for(version in c(77:96)){
# Connecting to two consecutives ensembl versions
prev_ensembl <- biomaRt::useEnsembl(biomart = "ENSEMBL_MART_ENSEMBL",
host = "www.ensembl.org",
dataset = "hsapiens_gene_ensembl",
version = prev_version)
ensembl <- biomaRt::useEnsembl(biomart = "ENSEMBL_MART_ENSEMBL",
host = "www.ensembl.org",
dataset = "hsapiens_gene_ensembl",
version = version)
# Get the gene names and IDs between two consecutives ensembl versions
prev_gene_names <- biomaRt::getBM(attributes = c("ensembl_gene_id", "external_gene_name"),
mart = prev_ensembl, uniqueRows = T)
gene_names <- biomaRt::getBM(attributes = c("ensembl_gene_id", "external_gene_name"),
mart = ensembl, uniqueRows = T)
# Merging the genes by ensembl stable IDs
name_changes <- merge(prev_gene_names, gene_names, by = "ensembl_gene_id",
all.x = F, all.y = F)
# Each row contains one stable ID, if there are two different gene names, it means
# there was a change between the two versions
changes <- unname(table(name_changes$external_gene_name.x != name_changes$external_gene_name.y))[2]
changes_between_versions <- rbind(changes_between_versions,
data.frame(version = paste0(prev_version, "-", version),
changes = changes))
prev_version = version
}
# Plot the results
ggplot2::ggplot(data = changes_between_versions, aes(x = version, y = changes, group = "ensembl_versions")) +
ggplot2::geom_line(color = "#4682B4") + geom_point(color = "#4682B4") +
ggplot2::theme_bw() +
ggplot2::xlab("Ensembl version") + ggplot2::ylab("# of name changes with previous version")