Question

Bioinformatics: Converting Protein Refseq ID to Entrez Gene Accession

0

Entering edit mode

4.6 years ago

tom5 • 0

Hi, I hope you are doing well. I ran BLAST alignment on a multi-gene FASTA file and return the top hit for each gene as a refseq protein ID (such as NP_001229937.1). I want to convert these protein IDs to Entrez Gene Accessions or Ensembl IDs. Is there a way to do so programmatically? I am working in R. I tried Biomart but it returned no matches for some of the input refseq protein IDs.

gene R Entrez • 1.7k views

ADD COMMENT • link updated 15 months ago by Ram 44k • written 4.6 years ago by tom5 • 0

score 1 · Answer 1 · 2020-05-04

1

Entering edit mode

4.6 years ago

GenoMax 147k

Using EntrezDirect:

$ esearch -db protein -query "NP_001229937" | elink -target nuccore | efetch -format acc
NM_001243008.1
NC_000067.6

OR

$ esearch -db protein -query "NP_001229937" | elink -target gene | efetch -format ft

1. Col6a3
Official Symbol: Col6a3 and Name: collagen, type VI, alpha 3 [Mus musculus (house mouse)]
Other Aliases: AI507288, Col6a-3
Other Designations: collagen alpha-3(VI) chain; collagen alpha 3 chain type VI; collagen alpha3(VI); procollagen, type VI, alpha 3; type VI collagen alpha 3 subunit
Chromosome: 1; Location: 1 45.53 cM
Annotation: Chromosome 1 NC_000067.6 (90766860..90844001, complement)
ID: 12835

ADD COMMENT • link 4.6 years ago by GenoMax 147k

0

Entering edit mode

Thank you! Is there a way to pass in a file with multiple ref seq IDs at once? Such as instead of query I were to use -Input "file name". I just need the gene symbol (and not the other information) for each Ref_seq ID. My final goal is a table of gene symbols corresponding to the input file of ref seq IDs.

ADD REPLY • link 4.6 years ago by tom5 • 0

0

Entering edit mode

Use something like (file with one accession per line, file.txt) :

cat file.txt | epost -db protein -format acc | elink -target nuccore | efetch -format acc

To get GeneNames:

$ esearch -db protein -query "NP_001229937" | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name
Col6a3

ADD REPLY • link 4.6 years ago by GenoMax 147k

0

Entering edit mode

Hi thank you for the quick reply! The second command you shared is exactly what I need, returning the gene symbol. However, I am not sure how to pass in a file of multiple sequences (one accession per line) to this command. Could you explain how to do something like this?

ADD REPLY • link 4.6 years ago by tom5 • 0

1

Entering edit mode

cat file.txt | epost -db protein -format acc | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name

file.txt should contain one accession per line.

ADD REPLY • link 4.6 years ago by GenoMax 147k

Ram · Answer 2 · 2020-05-04

0

Entering edit mode

4.6 years ago

brianj.park ▴ 60

You can use org.Mm.eg.db.

library(org.Mm.eg.db) 
Mm <- org.Mm.eg.db
my_symbol <- "NP_001229937"
select(Mm, keys = my_symbol, columns = c("REFSEQ", "ENSEMBL"), keytype = "REFSEQ")

 REFSEQ            ENSEMBL
1 NP_001229937 ENSMUSG00000048126

ADD COMMENT • link updated 15 months ago by Ram 44k • written 4.6 years ago by brianj.park ▴ 60

0

Entering edit mode

Thanks! However, when I try the ref seq "NP_033865.2", this returns 'None of the keys entered are valid keys for 'REFSEQ'". I double checked on NCBI and this is a valid gene entry. Please let me know if there's a way to resolve this issue. Thank you for your help!

ADD REPLY • link 4.6 years ago by tom5 • 0