Convert between RefSeq and Ensembl Transcript?
5
5
Entering edit mode
10.6 years ago
pwg46 ▴ 540

Hello, I am looking for a good data file to convert between refseq (specifically, accession identifiers such as NM_...) and ensembl transcript (ENST...). I am looking for a file which can be downloaded easily via a simple script usng ftp. Also, I would prefer it to be some form of text file with a decent format, as I would essentially be parsing the enst and NM_ parts and insertng them into a mysql table.

refseq ensembl • 38k views
ADD COMMENT
0
Entering edit mode

Are you looking to store ENS and NM ids for the same sequence? Or do you wanna store GenBank and ENSEMBL entries?

ADD REPLY
0
Entering edit mode

Hmm, I guess the latter. I've been looking through the GenBank data files, and they have large data files for each chromosome. These files do have ENST -> NM_ mappings for every trancsript on each chromosome, however I feel like using these data files would not be efficient. Not only are they large and take a fairly long time to download, but also parser scripts would take quite a while even though I simply want to create a tab-delimited txt file where the ENST id would be the first tab, and its corresponding NM_ id would be in the second tab.

ADD REPLY
0
Entering edit mode

I'd suggest tinkering with UCSC Genome Browser's mysql database. You should be able to write a query/script that, given ID1, does a bunch of SELECTs for ID2.

ADD REPLY
11
Entering edit mode
10.6 years ago
Bert Overduin ★ 3.7k
mysql -u anonymous -h ensembldb.ensembl.org

mysql> use homo_sapiens_core_75_37;

mysql> SELECT transcript.stable_id, xref.display_label FROM transcript, object_xref, xref,external_db WHERE transcript.transcript_id = object_xref.ensembl_id AND object_xref.ensembl_object_type = 'Transcript' AND object_xref.xref_id = xref.xref_id AND xref.external_db_id = external_db.external_db_id AND external_db.db_name = 'RefSeq_mRNA';

+-----------------+----------------+
| stable_id       | display_label  |
+-----------------+----------------+
| ENST00000517143 | NR_046932.1    |
| ENST00000362897 | NR_046944.1    |
| ENST00000384568 | NR_046948.1    |
| ENST00000384769 | NR_002574.1    |
| ENST00000384323 | NR_002575.1    |
| ENST00000516690 | NR_046934.1    |
| ENST00000363970 | NR_046947.1    |
| ENST00000365367 | NR_046928.1    |
| ENST00000517119 | NR_046935.1    |
| ENST00000427390 | NM_001145004.1 |
| ENST00000384474 | NR_046940.1    |
| ENST00000454856 | NM_001277303.1 |
| ENST00000516986 | NR_046930.1    |
| ENST00000559471 | NM_001193489.1 |
| ENST00000261847 | NM_014701.3    |
| ENST00000439682 | NM_001277304.1 |
| ENST00000439682 | NM_207355.2    |
| ENST00000411348 | NR_046931.1    |
| ENST00000298232 | NM_199259.2    |
| ENST00000361285 | NM_199261.2    |
| ENST00000342420 | NM_199260.2    |
| ENST00000569541 | NM_031421.2    |
| ENST00000299443 | NM_174981.3    |
| ENST00000399848 | NM_181482.4    |
| ENST00000359446 | NM_181481.4    |
ADD COMMENT
3
Entering edit mode

Or if you want GenBank (EMBL) accession numbers instead:

mysql> SELECT transcript.stable_id, xref.display_label 
>       FROM translation, transcript, object_xref, xref,external_db 
>       WHERE transcript.transcript_id = translation.transcript_id 
>       AND translation.translation_id = object_xref.ensembl_id 
>       AND object_xref.ensembl_object_type = 'Translation' 
>       AND object_xref.xref_id = xref.xref_id 
>       AND xref.external_db_id = external_db.external_db_id 
>       AND external_db.db_name = 'EMBL';
ADD REPLY
10
Entering edit mode
10.6 years ago

This is to query ensembl/biomart programmatically via the R library biomaRt:

library("biomaRt")
ensembl<-  useMart("ensembl", dataset="hsapiens_gene_ensembl")

values<- c("NM_001101", "NM_001256799", "NM_000594")

getBM(attributes=c("refseq_mrna", "ensembl_gene_id", "hgnc_symbol"), filters = "refseq_mrna", values = values, mart= ensembl)

Results:

   refseq_mrna ensembl_gene_id hgnc_symbol
1    NM_000594 ENSG00000232810         TNF
2    NM_001101 ENSG00000075624        ACTB
3 NM_001256799 ENSG00000111640       GAPDH

The output of getBM() can be written to a file in tabular format using write.table().

To know which datasets are in biomart and what attributes and filters they have:

listDatasets(useMart("ensembl"))
listFilters(ensembl)
listAttributes(ensembl)

I would also look into the various databases maintained in Bioconductor like org.Hs.eg.db

ADD COMMENT
5
Entering edit mode
9.2 years ago
ashbigdeli ▴ 50

I found this from another post, but I'll echo it here. For all ID conversions I have found this tool to be so so useful and its updated regularly.

http://biodbnet.abcc.ncifcrf.gov/db/db2db.php

ADD COMMENT
2
Entering edit mode
10.6 years ago

Why not just use Biomart? e.g. http://www.ensembl.org/biomart/martview/49b845b7dcb16ac15fcd8cd7d1461c6c gives you all RefSeq IDs and Ensembl Transcript IDs in a tab seperated format. You can get the Perl code hitting the "Perl" button.

ADD COMMENT
0
Entering edit mode

I prefer Perl to R, but the Perl Biomart API is pretty rough. I wouldn't rely on it for more than very simple bulk downloads. Also, depending on the species, some biotypes (e.g. mRNA) do not have NCBI RefSeq annotations available to them.

ADD REPLY
0
Entering edit mode

That might be true. Honestly, I have never used the Biomart API myself, as I prefer Python to Perl, I just always go for the XML and just submit a query to Biomart. Or simply download the full translation. There is no need to automate what only needs to be done once or twice. An example of automated queries can be found at Automating Database Searches.

ADD REPLY
0
Entering edit mode

I can not find a way to get the results in Refseq IDs, even if I select that information. Could you guide me with the necessary steps to get there, please?

ADD REPLY
1
Entering edit mode
20 days ago
Zhenyu Zhang ★ 1.2k

I dealt with gene name mapping quite often.

I usually build my own mapping table and save as TSV, so that I can reuse it for different purpose.

Unfortunately it's in a private repo, but the general idea is to download mappings from the following resources, combined together with gene names you can find from gencode GTFs

  1. NCBI: ncbi <- fread("ftp://ftp.ncbi.nlm.nih.gov/gene/DATA//GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz")
  2. org.Hs.eg.db: org.hs.mapping <- bitr(gencode$gene_name, fromType="SYMBOL", toType=c("ALIAS", "ENSEMBL"), OrgDb="org.Hs.eg.db")
  3. UCSC:kgalias <- fread("https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/kgAlias.txt.gz", h=F)

All these resources are reliable except UCSC, because it's a transcript mapping which sometimes give you FPs. What you can do is to set up your own rule on your preference: you'd like to map as many as possible with expectation of FPs, or you would only like to have the accurate mappings. I found this approach consistently works better than biomart download.

ADD COMMENT
0
Entering edit mode

This is really helpful. Please also mention "biomaRt" as mentioned in a different comment, then your answer will serve as the most comprehensive answer - one that anyone can look at as a one stop answer and stop reading other answers.

ADD REPLY
0
Entering edit mode

What if you are not working with humans? Not all gene mapping processes are between humans and other species. Sometimes, you have another species that is closer phylogenetically speaking and is better mapped using that species instead of humans. This example, and most of the ones that I found on the internet assumes that (human-non-human species), which is not always the case. Such databases like org.hs.eg.db, are not useful on those cases.

Please feel free to correct me if I'm wrong, and if you want to provide any good example for gene mapping between non-human versus non-human

ADD REPLY
1
Entering edit mode

This thread and the answer is not for mapping transcripts between species. This is for cross-mapping RefSeq and Ensembl transcripts ID for the same species.

ADD REPLY

Login before adding your answer.

Traffic: 3966 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6