I have two scRNA-seq datasets that I'm trying to integrate:
- Rat data:
- 24k genes from cellranger pipeline using NCBI reference
- Gene symbols like: Gata3, Alx4, Tcf15, Mt-cox1
- Human data:
- 44k genes from cellranger pipeline using Ensembl reference
- Gene symbols like: GATA3, ALX4, TCF15, MT-CO1
Problem:
- Most differences are just case-sensitive (GATA3 vs Gata3)
- Some genes have different nomenclature (Mt-cox1 vs MT-CO1)
- Using Ensembl BioMart directly only gives ~14k matches when I should get ~19k
- Need to find shared gene space for integration
Current approach I'm considering:
- Get Entrez IDs for rat genes
- Get Entrez IDs for human genes
- Map human Entrez IDs to their Ensembl gene symbols
Questions:
- Is there a better way to find orthologs between these datasets?
- Are there existing tables/resources that map these nomenclature differences?
- I am unable to find tables for Entrez ID for rat-human mapping, could anyone point me to those?
Any suggestions or alternative approaches would be appreciated.