Question

Convert between RefSeq and Ensembl Transcript?

5

Entering edit mode

10.8 years ago

pwg46 ▴ 540

Hello, I am looking for a good data file to convert between refseq (specifically, accession identifiers such as NM_...) and ensembl transcript (ENST...). I am looking for a file which can be downloaded easily via a simple script usng ftp. Also, I would prefer it to be some form of text file with a decent format, as I would essentially be parsing the enst and NM_ parts and insertng them into a mysql table.

refseq ensembl • 39k views

ADD COMMENT • link updated 3 months ago by GenoMax 151k • written 10.8 years ago by pwg46 ▴ 540

0

Entering edit mode

Are you looking to store ENS and NM ids for the same sequence? Or do you wanna store GenBank and ENSEMBL entries?

ADD REPLY • link 10.8 years ago by Ram 45k

0

Entering edit mode

Hmm, I guess the latter. I've been looking through the GenBank data files, and they have large data files for each chromosome. These files do have ENST -> NM_ mappings for every trancsript on each chromosome, however I feel like using these data files would not be efficient. Not only are they large and take a fairly long time to download, but also parser scripts would take quite a while even though I simply want to create a tab-delimited txt file where the ENST id would be the first tab, and its corresponding NM_ id would be in the second tab.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by pwg46 ▴ 540

0

Entering edit mode

I'd suggest tinkering with UCSC Genome Browser's mysql database. You should be able to write a query/script that, given ID1, does a bunch of SELECTs for ID2.

ADD REPLY • link 10.8 years ago by Ram 45k

Ram · Answer 1 · 2014-07-15

mysql -u anonymous -h ensembldb.ensembl.org

mysql> use homo_sapiens_core_75_37;

mysql> SELECT transcript.stable_id, xref.display_label FROM transcript, object_xref, xref,external_db WHERE transcript.transcript_id = object_xref.ensembl_id AND object_xref.ensembl_object_type = 'Transcript' AND object_xref.xref_id = xref.xref_id AND xref.external_db_id = external_db.external_db_id AND external_db.db_name = 'RefSeq_mRNA';

+-----------------+----------------+
| stable_id       | display_label  |
+-----------------+----------------+
| ENST00000517143 | NR_046932.1    |
| ENST00000362897 | NR_046944.1    |
| ENST00000384568 | NR_046948.1    |
| ENST00000384769 | NR_002574.1    |
| ENST00000384323 | NR_002575.1    |
| ENST00000516690 | NR_046934.1    |
| ENST00000363970 | NR_046947.1    |
| ENST00000365367 | NR_046928.1    |
| ENST00000517119 | NR_046935.1    |
| ENST00000427390 | NM_001145004.1 |
| ENST00000384474 | NR_046940.1    |
| ENST00000454856 | NM_001277303.1 |
| ENST00000516986 | NR_046930.1    |
| ENST00000559471 | NM_001193489.1 |
| ENST00000261847 | NM_014701.3    |
| ENST00000439682 | NM_001277304.1 |
| ENST00000439682 | NM_207355.2    |
| ENST00000411348 | NR_046931.1    |
| ENST00000298232 | NM_199259.2    |
| ENST00000361285 | NM_199261.2    |
| ENST00000342420 | NM_199260.2    |
| ENST00000569541 | NM_031421.2    |
| ENST00000299443 | NM_174981.3    |
| ENST00000399848 | NM_181482.4    |
| ENST00000359446 | NM_181481.4    |

Ram · Answer 2 · 2014-07-14

This is to query ensembl/biomart programmatically via the R library biomaRt:

library("biomaRt")
ensembl<-  useMart("ensembl", dataset="hsapiens_gene_ensembl")

values<- c("NM_001101", "NM_001256799", "NM_000594")

getBM(attributes=c("refseq_mrna", "ensembl_gene_id", "hgnc_symbol"), filters = "refseq_mrna", values = values, mart= ensembl)

Results:

   refseq_mrna ensembl_gene_id hgnc_symbol
1    NM_000594 ENSG00000232810         TNF
2    NM_001101 ENSG00000075624        ACTB
3 NM_001256799 ENSG00000111640       GAPDH

The output of getBM() can be written to a file in tabular format using write.table().

To know which datasets are in biomart and what attributes and filters they have:

listDatasets(useMart("ensembl"))
listFilters(ensembl)
listAttributes(ensembl)

I would also look into the various databases maintained in Bioconductor like org.Hs.eg.db

Ram · Answer 3 · 2015-12-08

5

Entering edit mode

9.4 years ago

ashbigdeli ▴ 50

I found this from another post, but I'll echo it here. For all ID conversions I have found this tool to be so so useful and its updated regularly.

http://biodbnet.abcc.ncifcrf.gov/db/db2db.php

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by ashbigdeli ▴ 50

Ram · Answer 4 · 2014-07-14

2

Entering edit mode

10.8 years ago

David Westergaard ★ 1.5k

Why not just use Biomart? e.g. http://www.ensembl.org/biomart/martview/49b845b7dcb16ac15fcd8cd7d1461c6c gives you all RefSeq IDs and Ensembl Transcript IDs in a tab seperated format. You can get the Perl code hitting the "Perl" button.

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by David Westergaard ★ 1.5k

0

Entering edit mode

I prefer Perl to R, but the Perl Biomart API is pretty rough. I wouldn't rely on it for more than very simple bulk downloads. Also, depending on the species, some biotypes (e.g. mRNA) do not have NCBI RefSeq annotations available to them.

ADD REPLY • link 10.8 years ago by pld 5.1k

0

Entering edit mode

That might be true. Honestly, I have never used the Biomart API myself, as I prefer Python to Perl, I just always go for the XML and just submit a query to Biomart. Or simply download the full translation. There is no need to automate what only needs to be done once or twice. An example of automated queries can be found at Automating Database Searches.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by David Westergaard ★ 1.5k

0

Entering edit mode

I can not find a way to get the results in Refseq IDs, even if I select that information. Could you guide me with the necessary steps to get there, please?

ADD REPLY • link 4 months ago by mauricio.1313 • 0

Ram · Answer 5 · 2025-01-28

1

Entering edit mode

3 months ago

Zhenyu Zhang ★ 1.3k

I dealt with gene name mapping quite often.

I usually build my own mapping table and save as TSV, so that I can reuse it for different purpose.

Unfortunately it's in a private repo, but the general idea is to download mappings from the following resources, combined together with gene names you can find from gencode GTFs

NCBI: ncbi <- fread("ftp://ftp.ncbi.nlm.nih.gov/gene/DATA//GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz")
org.Hs.eg.db: org.hs.mapping <- bitr(gencode$gene_name, fromType="SYMBOL", toType=c("ALIAS", "ENSEMBL"), OrgDb="org.Hs.eg.db")
UCSC:kgalias <- fread("https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/kgAlias.txt.gz", h=F)

All these resources are reliable except UCSC, because it's a transcript mapping which sometimes give you FPs. What you can do is to set up your own rule on your preference: you'd like to map as many as possible with expectation of FPs, or you would only like to have the accurate mappings. I found this approach consistently works better than biomart download.

ADD COMMENT • link updated 3 months ago by Ram 45k • written 3 months ago by Zhenyu Zhang ★ 1.3k

0

Entering edit mode

This is really helpful. Please also mention "biomaRt" as mentioned in a different comment, then your answer will serve as the most comprehensive answer - one that anyone can look at as a one stop answer and stop reading other answers.

ADD REPLY • link 3 months ago by Ram 45k

0

Entering edit mode

What if you are not working with humans? Not all gene mapping processes are between humans and other species. Sometimes, you have another species that is closer phylogenetically speaking and is better mapped using that species instead of humans. This example, and most of the ones that I found on the internet assumes that (human-non-human species), which is not always the case. Such databases like org.hs.eg.db, are not useful on those cases.

Please feel free to correct me if I'm wrong, and if you want to provide any good example for gene mapping between non-human versus non-human

ADD REPLY • link 3 months ago by mauricio.1313 • 0

1

Entering edit mode

This thread and the answer is not for mapping transcripts between species. This is for cross-mapping RefSeq and Ensembl transcripts ID for the same species.

ADD REPLY • link 3 months ago by GenoMax 151k