UCSC MySQL query to find gene symbols for all RefSeq IDs?
1
0
Entering edit mode
6.3 years ago
edsouza ▴ 30

I have a pipeline in which one of the BED files contains a bunch of RefSeq ID's. I obtained the original BED file from the UCSC Table Browser under the RefSeq table, so I assume that all of the RefSeq ID's will be located somewhere in UCSC's mysql database.

I ran this command:

mysql --user=genome -N --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e "select name,name2 from refGene"

However, when I compare the results of this output against what I have in my ID list, it appears that only ID's in the form of 'NR_' or 'NM_' have matches, whereas the ID's with 'XR_' or 'XM_' don't.

I know that there must be some relation between all of the RefSeq ID's and their gene symbols. For instance, XR_001755761 takes me to this NCBI page, which shows me that the corresponding gene symbol is 'LOC101928055'. I've been able to obtain gene symbols for all of the NR_ or NM_ identifiers using the mysql database, but don't know how to convert all the others.

Is there an easy way that I can programatically get all of these? I don't want to use a manual copy-paste service because this needs to be run in a pipeline. Currently, I run the SQL command and save it as a TSV, which I then serialize as a hashmap to let me quickly convert all the different RefSeq ID's to their respective gene symbol. The end goal is to simply count the number of unique genes in the file. What is the easiest way to get this done? I'm open to using R or some other script, as long as I only have to run it once so I can generate a python pickled dict for quick use.

ucsc genome gene assembly • 4.2k views
ADD COMMENT
1
Entering edit mode

. For instance, XR_001755761 take

don't spoil your time, those accession numbers are not present in the UCSC database.

http://genome.ucsc.edu/cgi-bin/hgTracks?org=Human&db=hg19&position=XR_001755761

http://genome.ucsc.edu/cgi-bin/hgTracks?org=Human&db=hg19&position=LOC101928055

you'd better use the resources from NCBI using eUtils.

ADD REPLY
0
Entering edit mode

As @Pierre said you can try the following using NCBI unix utils. Loop through your ID's.

$ efetch -db nuccore -id "XR_001755761" -format xml | xtract -pattern Gene-ref -element Gene-ref_locus
LOC101928055
ADD REPLY
0
Entering edit mode

This looks great, thank you! If I have several thousand ID's, is there an option to supply them at once in a single query so I don't get rate-limited?

ADD REPLY
0
Entering edit mode

I think you should follow the answer provided by UCSC support below in that case.

ADD REPLY
0
Entering edit mode

The refGene table only contains NM and NR sequences from NCBI, which are then mapped onto the genome with BLAT. If you want all sequences, including the XM, XR, etc, you want to query the "ncbiRefSeq" table:

mysql -h genome-mysql.soe.ucsc.edu -ugenome -Ne "select name, name2 from ncbiRefSeq" hg19

This blog post contains more information about the new ncbiRefSeq tables, which feature transcripts from RefSeq and use the coordinates provided by RefSeq instead of using coordinates determined by BLAT.

If you have further questions about UCSC data or tools feel free to send your question to one of the below mailing lists:

General questions: genome@soe.ucsc.edu
Questions involving private data: genome-www@soe.ucsc.edu
Questions involving mirror sites: genome-mirror@ose.ucsc.edu

ChrisL from the UCSC Genome Browser

ADD REPLY
1
Entering edit mode

I'm not completely sure; check these examples:

mysql -h genome-mysql.soe.ucsc.edu -ugenome -Ne "select name, name2 from ncbiRefSeq where name like 'NR%' limit 1" hg19
mysql -h genome-mysql.soe.ucsc.edu -ugenome -Ne "select name, name2 from ncbiRefSeq where name like 'NM%' limit 1" hg19
mysql -h genome-mysql.soe.ucsc.edu -ugenome -Ne "select name, name2 from ncbiRefSeq where name like 'XR%' limit 1" hg19
mysql -h genome-mysql.soe.ucsc.edu -ugenome -Ne "select name, name2 from ncbiRefSeq where name like 'XM%' limit 1" hg19

The first 2 will give results but the last 2 won't. If there truly are XR and XM sequences, that would be amazing, but I can't seem to find any.

ADD REPLY
1
Entering edit mode

NCBI did not include predicted transcripts with their latest hg19 annotation, as stated here:
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.25_GRCh37.p13/README_addendum.txt

Which is why they don't show up in our tables for hg19. I should have remembered this and not advised you to search these tables, my apologies.

ADD REPLY
0
Entering edit mode

genecats.ucsc : So your answer is no longer applicable? Consider moving it to a comment in that case.

ADD REPLY
2
Entering edit mode
6.3 years ago

To do something programmatic, you could use mygene.info with curl to request (query) and jq to parse the JSON result.

Given a text file with each gene name on its own line called my_genes_of_interest.txt, you could pass it to this script, e.g. called map_ids.sh:

#!/bin/bash -e

inFn=$1
apiURL="http://mygene.info/v3/query"

while IFS='' read -r line || [[ -n "$line" ]]; do
    geneId=`echo $line | tr -d $'\n'`
    queryResult=`curl -s "${apiURL}?q=${geneId}" | jq -r '.hits | .[0].symbol' | tr -d $'\n'`
    echo "${geneId}\t${queryResult}"
done < "${inFn}"

Then:

$ ./map_ids.sh my_genes_of_interest.txt > my_goi_mapped_to_symbols.txt

I didn't think to ask how many IDs you are querying, but sending a bunch of single requests may get your IP blocked. As your comment notes, there are Python and R libraries which offer the ability to submit batch queries to the mygene API.

ADD COMMENT
0
Entering edit mode

I'm trying this but I get rate-limited after a few hundred. Is there an easy way to supply a whole list of IDs for conversion at once?

ADD REPLY
2
Entering edit mode

Replying for posterity: mygene has a python API that lets you batch-submit queries. If your list is over 1000, it'll chunk your list into groups of 1000, perform the searches by group, and then return you a final result. The end-user doesn't have to worry about scaling, as the output format is exactly the same even if you'd only queried a single term. http://docs.mygene.info/projects/mygene-py/en/latest/

Here's the gist of my python code:

#!/usr/bin/env python3
import mygene

idlist = set()
with open('hg38.bed', 'r') as f:
        for line in f:
                id = line.strip().split('\t')[3]
                idlist.add(id)

mg = mygene.MyGeneInfo()
resdf = mg.querymany(list(idlist),
        scopes='refseq',
        fields='symbol',
        species='human',
        as_dataframe=True)

print(resdf) # Do stuff with the dataframe, like serialize it to a file.
ADD REPLY
0
Entering edit mode

Nice work — mygene is useful for mapping lots of kinds of IDs, as well, so this is generically useful.

ADD REPLY
0
Entering edit mode

Did this work for hg19 as well?

ADD REPLY
0
Entering edit mode

Do gene symbol mappings change between assemblies? I would expect that gene locations do, and that would definitely be something to consider when using mygene, if locations are retrieved (cf. http://myvariant.info/blog/mygene-info-moved-to-grch38hg38-for-human-genes-with-hg19-still-supported).

ADD REPLY
0
Entering edit mode

If you have RefSeq IDs to map to something else, I would use the NCBI efetch tools, as Pierre Lindenbaum suggests above. I would never use a third party webservice for a pipeline you want to run again.

ADD REPLY

Login before adding your answer.

Traffic: 2537 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6