I would like to have the biotypes (like in ensembl transcripts) but for the NCBI refSeq downloaded from UCSC mysql. Which table should I join and which column should I use?
My current query without the biotype is:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -u genome -D hg19 -N -A -e 'select bin,name,chrom,strand,txStart,txEnd,cdsStart,cdsEnd,name2 from refGene'
Using either biomart or UCSC you should be able to generate a table with the mappings between IDs. I have updated my answer below to outline how I did this with the UCSC Table Browser
Yes, you could get mapping between different IDs from UCSC. However, as far as I know, you cannot get the biotypes from UCSC. That was where I was going with my initial question, which probably wasn't very clear.
Of course, I understood that. I was just pointing out that really you need to use both most likely. You may be able to do something similar to what I posted below and get the RefSeq IDs in a table through Biomart directly all in one step.
Technically the biotype is for a transcript, and not a gene. While in many cases the biotype of all transcripts for a gene will be the same, you get a few that aren't. That said I'm not sure off the top of my head if you can do a join within UCSC between the two tables. You might need to output two datasets (refseq and ensembl) and cross-correlate them yourself with a little script. You can output alternative IDs in both tables, and use whichever you prefer to link the two tables.
Update: This is a relatively step-by-step guide of how I generated a table linking ensembl Transcript IDs to RefSeq IDs:
Group: Gene and Gene Predictions Group: Ensembl Table: ensGene
Output format: selected fields from primary and related tables
Click on get output
On next page under linked tables click the box beside ccdsInfo and then click allow selection from checked tables
More tables come up for linking click on ccds id under the CCDS table info fields. Also click on the table knownTo refseq and allow selection from checked tables again
Under known to refseq click both fields (primary id and value)
Click get output near the top of the page under the fields for the ensembl table
You'll end up with ensembl transcript IDs in the first column and a list of NM IDs in the final column. You can then process this however you like to get a mapping of refseq IDs to Ensembl. I didn't poke around enough to see if I could find a linked table to get biotype IDs, that may be easier to get from BioMart on the ensembl website itself. With those two files you should be able to parse them as tab delimited data and create a mapping file, associate biotypes, etc.with a fairly simple perl/python/scripting language of choice script.
ADD COMMENT
• link
updated 2.1 years ago by
Ram
44k
•
written 9.4 years ago by
DG
7.3k
1
Entering edit mode
"protein coding genes" was a misstype I wanted to say 'transcript'. I wanted to double check how to filter the NM_ list from a gene, for the ones that are for real and 'functional' transcripts. Also I would like to query in which tissue are they the primary transcript if any. Probably expression atlas would be the place to go for this last one question.
NM: known mRNA
XM: predicted mRNA
NR: known ncRNA
XR: predicted ncRNA
ADD REPLY
• link
updated 2.1 years ago by
Ram
44k
•
written 9.4 years ago by
Emily
24k
0
Entering edit mode
Thanks Emily I took me some time to find it out after posting the question. Never come across a XR thanks for that one. I wish this nomenclature were more easy to find when you google for refseq transcript nomenclature. You go to http://www.ncbi.nlm.nih.gov/refseq/about/ and you can read only about the X and the N:
Definitions:
Model RefSeq: RNA and protein products that are generated by the eukaryotic genome annotation pipeline. These records use accession prefixes XM_, XR_, and XP_.
Known RefSeq: RNA and protein products that are mainly derived from GenBank cDNA and EST data and are supported by the RefSeq eukaryotic curation group. These records use accession prefixes NM_, NR_, and NP_.
The word 'predicted' I think is a bit more meaningful than 'generated' when you quick read things in 'generated by the eukaryotic genome annotation pipeline'. 'Generated' is fine, but not easy to get its real meaning when you are in a hurry :-(
Does it have to be via UCSC? If you use Ensembl, you could get definitely get the biotypes.
For some aplications I use refseq and HGMD, both using NM. Has Ensembl already solved the problems mapping/integrating refseqs ids?
Using either biomart or UCSC you should be able to generate a table with the mappings between IDs. I have updated my answer below to outline how I did this with the UCSC Table Browser
Yes, you could get mapping between different IDs from UCSC. However, as far as I know, you cannot get the biotypes from UCSC. That was where I was going with my initial question, which probably wasn't very clear.
Of course, I understood that. I was just pointing out that really you need to use both most likely. You may be able to do something similar to what I posted below and get the RefSeq IDs in a table through Biomart directly all in one step.