Question

Two different RNA sequencing sets that have gene names set up differently - how to normalize?

0

Entering edit mode

4 months ago

awsk • 0

Hello,

I am trying to do some spatial cell type deconvolution, and so I need to bring together data from two different sets. I am working within R. I did one set (lets call set1), the spatial, and all of the gene name outputs for the gene matrix are consistently symbol-geneid, as such:

[1] "rps27a-ENSDARG00000032725"             "NC-002333.4-ENSDARG00000080337"       
[3] "CRACD-ENSDARG00000011602"              "eif5a2-ENSDARG00000056186"            
[5] "slc25a5-ENSDARG00000092553"            "myl10-ENSDARG00000062592"             
[7] "cct8-ENSDARG00000008243"               "COX5B-ENSDARG00000015978"             
[9] "rps2-ENSDARG00000077291"               "rps11-ENSDARG00000053058"

The other gene set (lets call set2) I am working with was processed by another lab and the naming is inconsistent. Sometimes it is the geneid, sometimes it is the symbol or name, sometimes it is both, as such:

 [323] "caprin2"                               "si:dkey-28i19.3"                      
 [325] "si:dkey-250k10.1"                      "znf1026-ENSDARG00000109604"           
 [327] "si:dkey-29p23.2"                       "si:dkeyp-53e4.4"                      
 [329] "ENSDARG00000009262"                    "si:ch211-226o13.1"

I need to set the names equal to each other so the pipeline can do its magic. Given that any value in the second set should be at least partially present in the first set, I thought maybe the angle would be a mass grep, replacing the value of a given entry in set2 with the best match from set1. My R skills are not sufficient to orchestrate that out, if anyone could help (or offer better alternatives).

The other solution I was trying inspired from [this post][1] involved using the mygene package in R, but unless I can get a 1:1 entry and maintain the ordering for the list, it would be difficult to use.

Any help on how to address this would be much appreciated, thank you.

R RNA-Sequencing • 682 views

ADD COMMENT • link 4 months ago by awsk • 0

1

Entering edit mode

Is there any chance you can get your hands on the raw reads? It would simplify everything, and ensure as much consistency between calls. I would be worried about using a dataset with what looks like an inconsistent naming convention - is the transcriptome they mapped to the same as set1? If not, then the comparison isn't exactly fair and could lead to some spurious results

ADD REPLY • link 4 months ago by dthorbur ★ 2.5k

0

Entering edit mode

I could get my hands on it yes, was just trying to avoid it. The transcriptome used should be identical, I think it's just how they decided to call the annotation data.

ADD REPLY • link 4 months ago by awsk • 0

score 1 · Answer 1 · 2024-06-28

1

Entering edit mode

4 months ago

swbarnes2 14k

Convert everything to Ensembl IDs. That will cause fewest headaches down the road.

Or start from scratch. Then you can be confident that everything will work together, and its easier to document what was done