I have a pipeline in which one of the BED files contains a bunch of RefSeq ID's. I obtained the original BED file from the UCSC Table Browser under the RefSeq table, so I assume that all of the RefSeq ID's will be located somewhere in UCSC's mysql database.
I ran this command:
mysql --user=genome -N --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e "select name,name2 from refGene"
However, when I compare the results of this output against what I have in my ID list, it appears that only ID's in the form of 'NR_' or 'NM_' have matches, whereas the ID's with 'XR_' or 'XM_' don't.
I know that there must be some relation between all of the RefSeq ID's and their gene symbols. For instance, XR_001755761 takes me to this NCBI page, which shows me that the corresponding gene symbol is 'LOC101928055'. I've been able to obtain gene symbols for all of the NR_ or NM_ identifiers using the mysql database, but don't know how to convert all the others.
Is there an easy way that I can programatically get all of these? I don't want to use a manual copy-paste service because this needs to be run in a pipeline. Currently, I run the SQL command and save it as a TSV, which I then serialize as a hashmap to let me quickly convert all the different RefSeq ID's to their respective gene symbol. The end goal is to simply count the number of unique genes in the file. What is the easiest way to get this done? I'm open to using R or some other script, as long as I only have to run it once so I can generate a python pickled dict for quick use.
don't spoil your time, those accession numbers are not present in the UCSC database.
http://genome.ucsc.edu/cgi-bin/hgTracks?org=Human&db=hg19&position=XR_001755761
http://genome.ucsc.edu/cgi-bin/hgTracks?org=Human&db=hg19&position=LOC101928055
you'd better use the resources from NCBI using eUtils.
As @Pierre said you can try the following using NCBI unix utils. Loop through your ID's.
This looks great, thank you! If I have several thousand ID's, is there an option to supply them at once in a single query so I don't get rate-limited?
I think you should follow the answer provided by UCSC support below in that case.
The refGene table only contains NM and NR sequences from NCBI, which are then mapped onto the genome with BLAT. If you want all sequences, including the XM, XR, etc, you want to query the "ncbiRefSeq" table:
This blog post contains more information about the new ncbiRefSeq tables, which feature transcripts from RefSeq and use the coordinates provided by RefSeq instead of using coordinates determined by BLAT.
If you have further questions about UCSC data or tools feel free to send your question to one of the below mailing lists:
ChrisL from the UCSC Genome Browser
I'm not completely sure; check these examples:
The first 2 will give results but the last 2 won't. If there truly are XR and XM sequences, that would be amazing, but I can't seem to find any.
NCBI did not include predicted transcripts with their latest hg19 annotation, as stated here:
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.25_GRCh37.p13/README_addendum.txt
Which is why they don't show up in our tables for hg19. I should have remembered this and not advised you to search these tables, my apologies.
genecats.ucsc : So your answer is no longer applicable? Consider moving it to a comment in that case.