Question

RefSeq Ids(NM*, NR*) to ensemble transcript Ids (ENST*)

1

Entering edit mode

8.8 years ago

Avi Srivastava ▴ 130

I have a set of data which has RefSeq Id of a transcript. I am unable to get unique Ensembl transcript Id for the refseq.

I've tried Biomart as suggested by some previous post but first of all they didn't give unique mappings to Ensembl and approx 6K of the refseq id cannot be find in the table.

Also tried UCSC mysql thing, the way they have suggested is, using table mrnaRefseq but when I try to get the table using command

mysql --user=genome -N --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e "select * from mrnaRefSeq" > test1.txt

I get

ERROR 1146 (42S02) at line 1: Table 'hg19.mrnaRefSeq' doesn't exist

Any suggestion guys?

Data can be found here. Second column of the data is relevant Refseq Ids

RNA-Seq • 8.2k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.8 years ago by Avi Srivastava ▴ 130

0

Entering edit mode

This should work with the UCSC table browser:

https://genome.ucsc.edu/cgi-bin/hgTables

Select assembly: hg19, track: Ensembl Genes, output format: selected fields from primary and related tables and then "get output". Under linked tables pick hg19.knownToEnsembl, hg19.knownToRefSeq and hg19.kgXref (using allow selection button at the bottom in-between). Then just check whatever columns you want (gene symbol, ensembl ID, RefSeq ID, etc.) and press "get output". This should create a tab-delimited file with the desired information.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by trausch ★ 1.9k

Ram · Answer 1 · 2016-01-21

hi,

Databases keep updating and if you work on large gene sets you would find that a bunch of IDs would mostly fall out of the sieve. Like NCBI RefSeq keep going under manual curation and if you keep checking every other month, some IDs get "suppressed"

I clicked your file link and it seemed that they are something related to Cufflinks output (?). If yes then the best would be to find out what was the GTF used (wherefrom like Ensembl, UCSC etc.). Ideally version matched. Like I have been working on the Ensembl 73 release data. Everytime I need to get something annotated, if I go to BioMart, I would use the archived ver. linked to 73 (btw, you can do that so through BioC pkg BioMart too).

If find the db version for your GTF is not an option then you can check this FTP link and download the gene2ensembl file, select out the taxon for human (9606 I think) and you have mapping of RefSeq to Ensembl. Though I can't gurantee if all your RefSeq's would get mapping but this is the most comprehensive place for NCBI gene anno.

score 0 · Answer 2 · 2016-01-21

0

Entering edit mode

8.8 years ago

Avi Srivastava ▴ 130

Thanks @Amitm, Yea I know that's the problem, I don't know which version of GTF was used for creating this.

I tried using the file path you've given too, I am still missing approx 7k RefSeqs.

ADD COMMENT • link 8.8 years ago by Avi Srivastava ▴ 130

0

Entering edit mode

hi,

I'm not sure if I know of another option. I am guessing that you have already tried Gene ID converters. Like one on DAVID. Depending on your biological questn., if you think that those IDs are must to be annotated then I think you can try this -

For those missing, from the Cufflinks result file, you should have the genomic coordinates. You must know at least if its hg19 or earlier or latest ver. And then use those coordinates and a latest gene anno file (NCBI, Ensembl, whatever suits you) to find out whom those coordinates overlap to. I think that should solve your issue.

ADD REPLY • link 8.8 years ago by Amitm ★ 2.3k