Strategy for mapping rat and human gene symbols from scRNA-seq data - dealing with inconsistent nomenclature
2
0
Entering edit mode
11 weeks ago
dxj294 • 0

I have two scRNA-seq datasets that I'm trying to integrate:

  1. Rat data:
  • 24k genes from cellranger pipeline using NCBI reference
  • Gene symbols like: Gata3, Alx4, Tcf15, Mt-cox1
  1. Human data:
  • 44k genes from cellranger pipeline using Ensembl reference
  • Gene symbols like: GATA3, ALX4, TCF15, MT-CO1

Problem:

  • Most differences are just case-sensitive (GATA3 vs Gata3)
  • Some genes have different nomenclature (Mt-cox1 vs MT-CO1)
  • Using Ensembl BioMart directly only gives ~14k matches when I should get ~19k
  • Need to find shared gene space for integration

Current approach I'm considering:

  1. Get Entrez IDs for rat genes
  2. Get Entrez IDs for human genes
  3. Map human Entrez IDs to their Ensembl gene symbols

Questions:

  1. Is there a better way to find orthologs between these datasets?
  2. Are there existing tables/resources that map these nomenclature differences?
  3. I am unable to find tables for Entrez ID for rat-human mapping, could anyone point me to those?

Any suggestions or alternative approaches would be appreciated.

Orthologs scRNAseq • 643 views
ADD COMMENT
2
Entering edit mode
11 weeks ago
ATpoint 87k

Need to find shared gene space for integration

The shared space is the orthologs that are annotated at reputable sources such as mentioned Ensembl, queried e.g. via biomaRt. There is not a single ortholog per gene, sometimes there is none, or in case of gene duplications and diversity there might be many per one rat gene. Hence, my recommendation is:

Query biomaRt for the ortholog table between both species. Make the inner join between this and the two datasets. Analyze this intersect.

ADD COMMENT
1
Entering edit mode
11 weeks ago

If only case sensitive, then map all characters to upper or lower case using a tool like sed or tr or an own python script.

I would get the cdna or protein sequences and map the various files together using Proteinortho. You'll get a nice TSV summary output of common and unique identifiers.

Then you can use csvtk join https://github.com/shenwei356/csvtk?tab=readme-ov-file to compare the groups of identifiers to further lists.

Or just add all relevant symbols to the identifiers in the fastas before mapping.

ADD COMMENT

Login before adding your answer.

Traffic: 1692 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6