I've got 3 rna seq datasets Im trying to integrate and there are ~1000genes with different aliases across sets (when I integrate them these genes do not merge appropriately as the Seurat package assumes these different aliases are in fact different genes)
Is there any easy way to go about homogenizing all of these? I've tried the online converters but that has not worked very well? Logically it seems easy to convert all of these to ensembl or entrez IDs and then convert back to gene IDs, but since these are sc-rna-seq datasets there is a variety of "gene names" which are for alignment/normalization that get lost during the conversion process since they aren't actually genes (these are of course needed for my downstream analysis). If I have a list of the ~1000 aliases, is there an easy way to search for these aliases in my datasets and replace with a single gene name?
Can you provide a few examples of these aliases? Are they all human genes or some other organism?
all human gene symbols, for example one is SELENOW (dataset#1), SELW (dataset#2), and SEPW1 (dataset #3)
HGNC:10752 is the unification you need here: https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/10752
The best resource that I use for this is GeneCards. While online searching (manually) is free, batch/command line access is not. I use it to narrow down on genes if ambiguity remains after filtering through all known sources (NCBI, biomart, HGNC, etc)
I will caution against relying too heavily on GeneCards, as it's essentially a commercial entity and it's not clear how much of their data is pulled. It's super convenient for a spot check, but as Ram mentioned, there are more open sources that are much easier to automate (several of which have been mentioned here) and are either updated in real-time or are the actual data providers.
You can download current HGNC names from their BioMart. Select
previous symbol
andalias symbol
for output in addition. That should cover situation Illustrated above.im curious if there's an easy way to automate this?
You can download the entire list by using "Download" button once you get the initial screen of results. There are ~42K symbols at the moment.
Just to clarify: By "gene ID", are you referring to HGNC symbols? What are "gene names"?
yep gene symbols, not sure if they are all the official HGNC one
You're going to need to be more specific and treat each symbol differently then - those that are official HGNC symbols, those that are synonyms or previous symbols of official HGNC symbols, those not found in HGNC but found in, say, the NCBI Gene database, etc. It's not an exact process, but it gets you close to addressing all entries. It will involve manual work too.
I would convert everything to Ensembl Gene ID which is the only reliable way to ensure that you have consistency. Is there anything wrong with this approach?
ENSG IDs don't have a 1:1 mapping to HGNC gene symbols. One way to narrow it down is to eliminate those ENSGs that map to HGNC symbols for genes that fall outside the 1-22XY chromosomes (i.e. in the patch/alt contigs).
Ok so I downloaded the HGNC dataset (thanks to all who helped me find this) and reorganized my data, now I have
List #1- All features (gene names..aliases and official) from my dataset in one column. List #2 -official gene names from HGNC (column 1) an aliase (column 2)
Im currently trying to find functions in R, but it seems some type of loop could automate the scheme below? (Im terrible at coding as im an experimentalist so if anyone has ideas im happy to receive)
-for each row in list #1, search the value (which should be a string?) for a match in list #2 column #2 (the aliases), if a match is found, replace the original string in List#1 with the value in list #2 column#1
This would work right?
Try some things out and if you run into problems ask a new focused question providing any code you tried to use.
Don't replace - you may end up with duplicates. This problem needs a much more nuanced and save-state-at-each-step approach than a simple Find-and-Replace. Tread carefully while mapping - replacing without a record of what you replaced and what you replaced it with, along with where you got these from as well as when the replacement was done will get you lost where you won't be able to map your final RNAseq dataset to your initial RNAseq dataset.
Hi all! Thanks for the help I think I have got this working. However, some have recommended I find a way to check if the results or accurate (or do something more advanced than a find& replace because this could introduce errors?) Does anyone have some recommendations? I manually checked the results & it seemed accurate (although I obviously cant check 26,000 genes manually). I also started out with a trial dataset of ~100 genes and it was 100% accurate for that. Do people still think something more complicated/additional validation is necessary?
See code below: (dataset is a dataframe of 26k gene names & database is a dataframe w/ 2 columns one of which contains gene aliases and the other which contains the approved symbol
I know I'm being a pain in the behind with this - any good method you use will have 90-95% accuracy on the entire 26,000 gene dataset. You'll only run into problems with pseudogenes, AS/DT entries etc. There will come a point when you will have to ignore a few entries, so you may as well go with what works best for you and fix errors as you encounter them later. Gene identifiers are not a perfect thing and you'll never have a solution that works 100%.