Question

Strategy for mapping rat and human gene symbols from scRNA-seq data - dealing with inconsistent nomenclature

0

Entering edit mode

11 weeks ago

dxj294 • 0

I have two scRNA-seq datasets that I'm trying to integrate:

Rat data:

24k genes from cellranger pipeline using NCBI reference
Gene symbols like: Gata3, Alx4, Tcf15, Mt-cox1

Human data:

44k genes from cellranger pipeline using Ensembl reference
Gene symbols like: GATA3, ALX4, TCF15, MT-CO1

Problem:

Most differences are just case-sensitive (GATA3 vs Gata3)
Some genes have different nomenclature (Mt-cox1 vs MT-CO1)
Using Ensembl BioMart directly only gives ~14k matches when I should get ~19k
Need to find shared gene space for integration

Current approach I'm considering:

Get Entrez IDs for rat genes
Get Entrez IDs for human genes
Map human Entrez IDs to their Ensembl gene symbols

Questions:

Is there a better way to find orthologs between these datasets?
Are there existing tables/resources that map these nomenclature differences?
I am unable to find tables for Entrez ID for rat-human mapping, could anyone point me to those?

Any suggestions or alternative approaches would be appreciated.

Orthologs scRNAseq • 643 views

ADD COMMENT • link updated 11 weeks ago by ATpoint 87k • written 11 weeks ago by dxj294 • 0

score 2 · Answer 1 · 2025-01-15

Need to find shared gene space for integration

The shared space is the orthologs that are annotated at reputable sources such as mentioned Ensembl, queried e.g. via biomaRt. There is not a single ortholog per gene, sometimes there is none, or in case of gene duplications and diversity there might be many per one rat gene. Hence, my recommendation is:

Query biomaRt for the ortholog table between both species. Make the inner join between this and the two datasets. Analyze this intersect.

score 1 · Answer 2 · 2025-01-15

If only case sensitive, then map all characters to upper or lower case using a tool like sed or tr or an own python script.

I would get the cdna or protein sequences and map the various files together using Proteinortho. You'll get a nice TSV summary output of common and unique identifiers.

Then you can use csvtk join https://github.com/shenwei356/csvtk?tab=readme-ov-file to compare the groups of identifiers to further lists.

Or just add all relevant symbols to the identifiers in the fastas before mapping.