Convert between RefSeq and Ensembl Transcript?
4
5
Entering edit mode
10.4 years ago
pwg46 ▴ 540

Hello, I am looking for a good data file to convert between refseq (specifically, accession identifiers such as NM_...) and ensembl transcript (ENST...). I am looking for a file which can be downloaded easily via a simple script usng ftp. Also, I would prefer it to be some form of text file with a decent format, as I would essentially be parsing the enst and NM_ parts and insertng them into a mysql table.

refseq ensembl • 37k views
ADD COMMENT
0
Entering edit mode

Are you looking to store ENS and NM ids for the same sequence? Or do you wanna store GenBank and ENSEMBL entries?

ADD REPLY
0
Entering edit mode

Hmm, I guess the latter. I've been looking through the GenBank data files, and they have large data files for each chromosome. These files do have ENST -> NM_ mappings for every trancsript on each chromosome, however I feel like using these data files would not be efficient. Not only are they large and take a fairly long time to download, but also parser scripts would take quite a while even though I simply want to create a tab-delimited txt file where the ENST id would be the first tab, and its corresponding NM_ id would be in the second tab.

ADD REPLY
0
Entering edit mode

I'd suggest tinkering with UCSC Genome Browser's mysql database. You should be able to write a query/script that, given ID1, does a bunch of SELECTs for ID2.

ADD REPLY
11
Entering edit mode
10.4 years ago
Bert Overduin ★ 3.7k
mysql -u anonymous -h ensembldb.ensembl.org

mysql> use homo_sapiens_core_75_37;

mysql> SELECT transcript.stable_id, xref.display_label FROM transcript, object_xref, xref,external_db WHERE transcript.transcript_id = object_xref.ensembl_id AND object_xref.ensembl_object_type = 'Transcript' AND object_xref.xref_id = xref.xref_id AND xref.external_db_id = external_db.external_db_id AND external_db.db_name = 'RefSeq_mRNA';

+-----------------+----------------+
| stable_id       | display_label  |
+-----------------+----------------+
| ENST00000517143 | NR_046932.1    |
| ENST00000362897 | NR_046944.1    |
| ENST00000384568 | NR_046948.1    |
| ENST00000384769 | NR_002574.1    |
| ENST00000384323 | NR_002575.1    |
| ENST00000516690 | NR_046934.1    |
| ENST00000363970 | NR_046947.1    |
| ENST00000365367 | NR_046928.1    |
| ENST00000517119 | NR_046935.1    |
| ENST00000427390 | NM_001145004.1 |
| ENST00000384474 | NR_046940.1    |
| ENST00000454856 | NM_001277303.1 |
| ENST00000516986 | NR_046930.1    |
| ENST00000559471 | NM_001193489.1 |
| ENST00000261847 | NM_014701.3    |
| ENST00000439682 | NM_001277304.1 |
| ENST00000439682 | NM_207355.2    |
| ENST00000411348 | NR_046931.1    |
| ENST00000298232 | NM_199259.2    |
| ENST00000361285 | NM_199261.2    |
| ENST00000342420 | NM_199260.2    |
| ENST00000569541 | NM_031421.2    |
| ENST00000299443 | NM_174981.3    |
| ENST00000399848 | NM_181482.4    |
| ENST00000359446 | NM_181481.4    |
ADD COMMENT
3
Entering edit mode

Or if you want GenBank (EMBL) accession numbers instead:

mysql> SELECT transcript.stable_id, xref.display_label 
>       FROM translation, transcript, object_xref, xref,external_db 
>       WHERE transcript.transcript_id = translation.transcript_id 
>       AND translation.translation_id = object_xref.ensembl_id 
>       AND object_xref.ensembl_object_type = 'Translation' 
>       AND object_xref.xref_id = xref.xref_id 
>       AND xref.external_db_id = external_db.external_db_id 
>       AND external_db.db_name = 'EMBL';
ADD REPLY
8
Entering edit mode
10.4 years ago

This is to query ensembl/biomart programmatically via the R library biomaRt:

library("biomaRt")
ensembl<-  useMart("ensembl", dataset="hsapiens_gene_ensembl")

values<- c("NM_001101", "NM_001256799", "NM_000594")

getBM(attributes=c("refseq_mrna", "ensembl_gene_id", "hgnc_symbol"), filters = "refseq_mrna", values = values, mart= ensembl)

Results:

   refseq_mrna ensembl_gene_id hgnc_symbol
1    NM_000594 ENSG00000232810         TNF
2    NM_001101 ENSG00000075624        ACTB
3 NM_001256799 ENSG00000111640       GAPDH

The output of getBM() can be written to a file in tabular format using write.table().

To know which datasets are in biomart and what attributes and filters they have:

listDatasets(useMart("ensembl"))
listFilters(ensembl)
listAttributes(ensembl)

I would also look into the various databases maintained in Bioconductor like org.Hs.eg.db

ADD COMMENT
4
Entering edit mode
9.0 years ago
ashbigdeli ▴ 40

I found this from another post, but I'll echo it here. For all ID conversions I have found this tool to be so so useful and its updated regularly.

http://biodbnet.abcc.ncifcrf.gov/db/db2db.php

ADD COMMENT
2
Entering edit mode
10.4 years ago

Why not just use Biomart? e.g. http://www.ensembl.org/biomart/martview/49b845b7dcb16ac15fcd8cd7d1461c6c gives you all RefSeq IDs and Ensembl Transcript IDs in a tab seperated format. You can get the Perl code hitting the "Perl" button.

ADD COMMENT
0
Entering edit mode

I prefer Perl to R, but the Perl Biomart API is pretty rough. I wouldn't rely on it for more than very simple bulk downloads. Also, depending on the species, some biotypes (e.g. mRNA) do not have NCBI RefSeq annotations available to them.

ADD REPLY
0
Entering edit mode

That might be true. Honestly, I have never used the Biomart API myself, as I prefer Python to Perl, I just always go for the XML and just submit a query to Biomart. Or simply download the full translation. There is no need to automate what only needs to be done once or twice. An example of automated queries can be found at Automating Database Searches.

ADD REPLY

Login before adding your answer.

Traffic: 1421 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6