I have downloaded a public dataset that only contains gene names (note that the species is Mouse).
As I want to do some downstream analyses, I decided to use biomaRt
to get some attributes that I will need and I want to have in my final files. In particular, I am getting the Ensembl ID, entrez ID (or NCBI), biotype, description and the external synonym.
I use this code:
library(biomaRt)
mart <- useMart("ensembl", dataset="mmusculus_gene_ensembl")
BiomartInfo <- getBM(
attributes = c('external_gene_name', 'ensembl_gene_id', 'entrezgene_id', 'gene_biotype', 'description', 'external_synonym'),
mart = mart
)
mydata <- merge(data, BiomartInfo, all.x=T)
When I was checking the list of genes... I found a gene that contains the difficulties that shows the question of this post. The gene is Cyhr1 --> there is not associated an Ensembl ID or Entrez ID. However, if we add to our query the attribute "external_synonym", we can find Cyhr1 in this column name (it was the previous name for this gene).
As you can see, this gene has 3 different Ensembl IDs and Entrez IDs, and only one of them has a MGI ID.
I am aware that the difficulties of getting different IDs for a gene name is not new and there are multiple posts and apart of the recommendations of not using gene names for the future and the difficulty of the annotation databases... I have not found anything related to MGI IDs/Symbols.
When we talk in the human field it is recommended to use the HGNC ID to ensure stability... The HGNC is responsible for approving unique symbols and names of human loci, including protein coding genes, ncRNA genes and pseudogenes, to allow unambiguous scientific communication.
For genes in non-human vertebrates, in this paper, it is advised that symbols approved by the relevant species-specific nomenclature committees (e.g., MGI for mouse) or the Vertebrate Gene Nomenclature Committee (VGNC) are used.
Therefore... I want to ask you the following question: In cases when we have several different IDs, should we follow the committees? (and therefore, in the previous example, keep only the IDs (Ensembl/Entrez) of the one that has MGI symbol/ID?
Any feedback will be very welcomed.
Thanks in advance.