get biotype for refseq NM transcripts
2
3
Entering edit mode
9.4 years ago

I would like to have the biotypes (like in ensembl transcripts) but for the NCBI refSeq downloaded from UCSC mysql. Which table should I join and which column should I use?

My current query without the biotype is:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -u genome -D hg19 -N -A -e 'select bin,name,chrom,strand,txStart,txEnd,cdsStart,cdsEnd,name2 from refGene'
refseq • 6.6k views
ADD COMMENT
2
Entering edit mode

Does it have to be via UCSC? If you use Ensembl, you could get definitely get the biotypes.

ADD REPLY
0
Entering edit mode

For some aplications I use refseq and HGMD, both using NM. Has Ensembl already solved the problems mapping/integrating refseqs ids?

ADD REPLY
1
Entering edit mode

Using either biomart or UCSC you should be able to generate a table with the mappings between IDs. I have updated my answer below to outline how I did this with the UCSC Table Browser

ADD REPLY
1
Entering edit mode

Yes, you could get mapping between different IDs from UCSC. However, as far as I know, you cannot get the biotypes from UCSC. That was where I was going with my initial question, which probably wasn't very clear.

ADD REPLY
1
Entering edit mode

Of course, I understood that. I was just pointing out that really you need to use both most likely. You may be able to do something similar to what I posted below and get the RefSeq IDs in a table through Biomart directly all in one step.

ADD REPLY
5
Entering edit mode
9.4 years ago
DG 7.3k

Technically the biotype is for a transcript, and not a gene. While in many cases the biotype of all transcripts for a gene will be the same, you get a few that aren't. That said I'm not sure off the top of my head if you can do a join within UCSC between the two tables. You might need to output two datasets (refseq and ensembl) and cross-correlate them yourself with a little script. You can output alternative IDs in both tables, and use whichever you prefer to link the two tables.

Update: This is a relatively step-by-step guide of how I generated a table linking ensembl Transcript IDs to RefSeq IDs:

UCSC Table Browser: https://genome.ucsc.edu/cgi-bin/hgTables

Group: Gene and Gene Predictions Group: Ensembl Table: ensGene

Output format: selected fields from primary and related tables

  1. Click on get output
  2. On next page under linked tables click the box beside ccdsInfo and then click allow selection from checked tables
  3. More tables come up for linking click on ccds id under the CCDS table info fields. Also click on the table knownTo refseq and allow selection from checked tables again
  4. Under known to refseq click both fields (primary id and value)
  5. Click get output near the top of the page under the fields for the ensembl table

You'll end up with ensembl transcript IDs in the first column and a list of NM IDs in the final column. You can then process this however you like to get a mapping of refseq IDs to Ensembl. I didn't poke around enough to see if I could find a linked table to get biotype IDs, that may be easier to get from BioMart on the ensembl website itself. With those two files you should be able to parse them as tab delimited data and create a mapping file, associate biotypes, etc.with a fairly simple perl/python/scripting language of choice script.

ADD COMMENT
1
Entering edit mode

"protein coding genes" was a misstype I wanted to say 'transcript'. I wanted to double check how to filter the NM_ list from a gene, for the ones that are for real and 'functional' transcripts. Also I would like to query in which tissue are they the primary transcript if any. Probably expression atlas would be the place to go for this last one question.

ADD REPLY
1
Entering edit mode
9.4 years ago
poisonAlien ★ 3.2k

Simplest way is to look at refseq id's.

Protein coding genes start with 'NM' and non coding’s starts with 'NR'

There is whole bunch of nomenclature.

ADD COMMENT
2
Entering edit mode

That would only give you two biotypes. Ensembl/GENCODE provides a much richer classification: http://www.gencodegenes.org/gencode_biotypes.html

ADD REPLY
0
Entering edit mode

Does the NM have biotype labels as diverse as ensembl has?

ADD REPLY
1
Entering edit mode

NM* is just an NCBI ID and there is no biotype associated with it as far as I know.

ADD REPLY
1
Entering edit mode

I saw that in wikipedia NR is for non coding, but I read somewareelse that NR was for predicted transcripts, so I was not sure.

ADD REPLY
2
Entering edit mode

N is known, X is predicted.

NM: known mRNA
XM: predicted mRNA
NR: known ncRNA
XR: predicted ncRNA
ADD REPLY
0
Entering edit mode

Thanks Emily I took me some time to find it out after posting the question. Never come across a XR thanks for that one. I wish this nomenclature were more easy to find when you google for refseq transcript nomenclature. You go to http://www.ncbi.nlm.nih.gov/refseq/about/ and you can read only about the X and the N:

Definitions:

  • Model RefSeq: RNA and protein products that are generated by the eukaryotic genome annotation pipeline. These records use accession prefixes XM_, XR_, and XP_.
  • Known RefSeq: RNA and protein products that are mainly derived from GenBank cDNA and EST data and are supported by the RefSeq eukaryotic curation group. These records use accession prefixes NM_, NR_, and NP_.

The word 'predicted' I think is a bit more meaningful than 'generated' when you quick read things in 'generated by the eukaryotic genome annotation pipeline'. 'Generated' is fine, but not easy to get its real meaning when you are in a hurry :-(

ADD REPLY
0
Entering edit mode

NM=mRNA & NR=RNA from NCBI tutorials.

When you think it twice, it make sense to read it as NR not-mRNA

http://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly

ADD REPLY

Login before adding your answer.

Traffic: 1782 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6