Question

Easiest way to homogenize gene aliases across datasets?

0

Entering edit mode

4.0 years ago

dlmatera ▴ 10

I've got 3 rna seq datasets Im trying to integrate and there are ~1000genes with different aliases across sets (when I integrate them these genes do not merge appropriately as the Seurat package assumes these different aliases are in fact different genes)

Is there any easy way to go about homogenizing all of these? I've tried the online converters but that has not worked very well? Logically it seems easy to convert all of these to ensembl or entrez IDs and then convert back to gene IDs, but since these are sc-rna-seq datasets there is a variety of "gene names" which are for alignment/normalization that get lost during the conversion process since they aren't actually genes (these are of course needed for my downstream analysis). If I have a list of the ~1000 aliases, is there an easy way to search for these aliases in my datasets and replace with a single gene name?

RNA-Seq R gene alignment • 5.8k views

ADD COMMENT • link 3.9 years ago by dlmatera ▴ 10

0

Entering edit mode

Can you provide a few examples of these aliases? Are they all human genes or some other organism?

ADD REPLY • link 4.0 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

all human gene symbols, for example one is SELENOW (dataset#1), SELW (dataset#2), and SEPW1 (dataset #3)

ADD REPLY • link 4.0 years ago by dlmatera ▴ 10

0

Entering edit mode

HGNC:10752 is the unification you need here: https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/10752

Symbol: SELENOW
Previous Symbol(s): SEPW1
Alias Symbol(s): SELW

The best resource that I use for this is GeneCards. While online searching (manually) is free, batch/command line access is not. I use it to narrow down on genes if ambiguity remains after filtering through all known sources (NCBI, biomart, HGNC, etc)

ADD REPLY • link 4.0 years ago by Ram 44k

1

Entering edit mode

I will caution against relying too heavily on GeneCards, as it's essentially a commercial entity and it's not clear how much of their data is pulled. It's super convenient for a spot check, but as Ram mentioned, there are more open sources that are much easier to automate (several of which have been mentioned here) and are either updated in real-time or are the actual data providers.

ADD REPLY • link 4.0 years ago by jared.andrews07 ★ 18k

0

Entering edit mode

You can download current HGNC names from their BioMart. Select previous symbol and alias symbol for output in addition. That should cover situation Illustrated above.

ADD REPLY • link 4.0 years ago by GenoMax 147k

0

Entering edit mode

im curious if there's an easy way to automate this?

ADD REPLY • link 4.0 years ago by dlmatera ▴ 10

1

Entering edit mode

You can download the entire list by using "Download" button once you get the initial screen of results. There are ~42K symbols at the moment.

ADD REPLY • link 4.0 years ago by GenoMax 147k

0

Entering edit mode

Just to clarify: By "gene ID", are you referring to HGNC symbols? What are "gene names"?

ADD REPLY • link 4.0 years ago by Ram 44k

0

Entering edit mode

yep gene symbols, not sure if they are all the official HGNC one

ADD REPLY • link 4.0 years ago by dlmatera ▴ 10

0

Entering edit mode

You're going to need to be more specific and treat each symbol differently then - those that are official HGNC symbols, those that are synonyms or previous symbols of official HGNC symbols, those not found in HGNC but found in, say, the NCBI Gene database, etc. It's not an exact process, but it gets you close to addressing all entries. It will involve manual work too.

ADD REPLY • link 4.0 years ago by Ram 44k

0

Entering edit mode

I would convert everything to Ensembl Gene ID which is the only reliable way to ensure that you have consistency. Is there anything wrong with this approach?

ADD REPLY • link 4.0 years ago by ATpoint 85k

0

Entering edit mode

ENSG IDs don't have a 1:1 mapping to HGNC gene symbols. One way to narrow it down is to eliminate those ENSGs that map to HGNC symbols for genes that fall outside the 1-22XY chromosomes (i.e. in the patch/alt contigs).

ADD REPLY • link 4.0 years ago by Ram 44k

0

Entering edit mode

Ok so I downloaded the HGNC dataset (thanks to all who helped me find this) and reorganized my data, now I have

List #1- All features (gene names..aliases and official) from my dataset in one column. List #2 -official gene names from HGNC (column 1) an aliase (column 2)

Im currently trying to find functions in R, but it seems some type of loop could automate the scheme below? (Im terrible at coding as im an experimentalist so if anyone has ideas im happy to receive)

-for each row in list #1, search the value (which should be a string?) for a match in list #2 column #2 (the aliases), if a match is found, replace the original string in List#1 with the value in list #2 column#1

This would work right?

ADD REPLY • link 4.0 years ago by dlmatera ▴ 10

1

Entering edit mode

Try some things out and if you run into problems ask a new focused question providing any code you tried to use.

ADD REPLY • link 4.0 years ago by GenoMax 147k

0

Entering edit mode

Don't replace - you may end up with duplicates. This problem needs a much more nuanced and save-state-at-each-step approach than a simple Find-and-Replace. Tread carefully while mapping - replacing without a record of what you replaced and what you replaced it with, along with where you got these from as well as when the replacement was done will get you lost where you won't be able to map your final RNAseq dataset to your initial RNAseq dataset.

ADD REPLY • link 4.0 years ago by Ram 44k

0

Entering edit mode

Hi all! Thanks for the help I think I have got this working. However, some have recommended I find a way to check if the results or accurate (or do something more advanced than a find& replace because this could introduce errors?) Does anyone have some recommendations? I manually checked the results & it seemed accurate (although I obviously cant check 26,000 genes manually). I also started out with a trial dataset of ~100 genes and it was 100% accurate for that. Do people still think something more complicated/additional validation is necessary?

See code below: (dataset is a dataframe of 26k gene names & database is a dataframe w/ 2 columns one of which contains gene aliases and the other which contains the approved symbol

indx<-match(dataset$gene, databse$Alias, nomatch = 0)  #create and index of all locations in the dataset where the gene name matches that of a known gene alias 

dataset$gene[indx !=0] <- database$approvedsymbol[indx]
 #for locations where indx=/0 (when a gene alias is found), replace the gene name in the dataset with its approved symbol from the databse

ADD REPLY • link updated 3.9 years ago by GenoMax 147k • written 3.9 years ago by dlmatera ▴ 10

1

Entering edit mode

I know I'm being a pain in the behind with this - any good method you use will have 90-95% accuracy on the entire 26,000 gene dataset. You'll only run into problems with pseudogenes, AS/DT entries etc. There will come a point when you will have to ignore a few entries, so you may as well go with what works best for you and fix errors as you encounter them later. Gene identifiers are not a perfect thing and you'll never have a solution that works 100%.

ADD REPLY • link 3.9 years ago by Ram 44k

score 3 · Answer 1 · 2020-12-09

The best tool I've found for dealing with gene symbol aliases is the MyGene API. It has associated R and python packages as well.

It's relatively straightforward to use, but the general process for this would go something like:

1.) Query via gene symbol and grab the top hit:

http://mygene.info/v3/query?q=symbol:cdk2&species=human

{"took": 4, 
"total": 1, 
"max_score": 88.06128, 
"hits": [{"_id": "1017", "_score": 88.06128, "entrezgene": "1017", "name": "cyclin dependent kinase 2", "symbol": "CDK2", "taxid": 9606}]}

2.) Pull the aliases/official symbol for the Entrez ID of the top hit:

https://mygene.info/v3/gene/1017?fields=alias,symbol

{"_id": "1017", "_version": 9, "alias": ["CDKN2", "p33(CDK2)"], "symbol": "CDK2"}

3.) Look for alias symbols in your other samples and change them to the official symbol. If necessary, change your query symbol to the official one too (as it could be an alias). I leave the implementation of this to you. I highly recommend you retain the original symbols somewhere rather than replacing them outright. In reality, I'd use stable gene identifiers as much as possible (ENSG preferably). As others have mentioned, this isn't a simple problem and has many edge cases.

I expect this will deal with most genes relatively well. The lncRNAs/pseudogenes/etc are a crap shoot.

h.mon · Answer 2 · 2020-12-09

You can use NCBI Datasets for this. Specifically, you can use the command line tool as follows:

for s in SELENOW SELW SEPW1 ; do 
echo -ne "$s\t" ; 
datasets summary gene symbol $s | jq -r '.genes[] | [.gene.gene_id,.gene.symbol] | @tsv ' ; 
done 

SELENOW 6415    SELENOW
SELW    6415    SELENOW
SEPW1   6415    SELENOW

The output of datasets command here is in json format that I am parsing using the tool jq. The final tab-delimited output has the input symbol, NCBI GeneID and the HGNC symbol.

score 2 · Answer 3 · 2020-12-09

With EntrezDirect (additional parsing needed) :

$ esearch -db gene -query "human [orgn]" | efetch | grep -A 1 "Official" | head -11

Official Symbol: TP53 and Name: tumor protein p53 [Homo sapiens (human)]
Other Aliases: BCC7, BMFS5, LFS1, P53, TRP53
--
Official Symbol: EGFR and Name: epidermal growth factor receptor [Homo sapiens (human)]
Other Aliases: ERBB, ERBB1, HER1, NISBD2, PIG61, mENA
--
Official Symbol: TNF and Name: tumor necrosis factor [Homo sapiens (human)]
Other Aliases: DIF-alpha, TNFA, TNFSF2, TNLG1F, TNF
--
Official Symbol: APOE and Name: apolipoprotein E [Homo sapiens (human)]
Other Aliases: AD2, APO-E, ApoE4, LDLCQ5, LPG