Mapping refseq transcripts to encoded proteins (NM_ to NP_)
3
0
Entering edit mode
10.3 years ago

I'd like to find the protein encoded by my refgene transcripts.

I have the refGene.txt file where each line looks like the following:

138    NM_016166    chr15    +    68346571    68480404    68346664    68480173    14    68346571,68378643,68434283,68434627,68438153,68438903,68445927,68457068,68466069,68467974,68468811,68473549,68475967,68479879,    68346688,68379088,68434368,68434675,68438244,68439038,68446033,68457142,68466230,68468105,68468992,68473692,68476005,68480404,    0    PIAS1    cmpl    cmpl    0,0,1,2,2,0,0,1,0,2,1,2,1,0,

I'd like to know how I can get the NP_XXXX name for the transcript NM_016166.

Preferably, I'd like to just get a text file mapping the two in some way, but I'll accept any answer that does this automatically, e.g. with BioPython (having to look it up by hand in a browser or some such doesn't cut it - I need to do this for 50K transcripts).

I'm working with hg38, but I'm guessing the procedure is the same for all major genome versions, so I did not specify to make the Q as general as possible.

refseq • 5.3k views
ADD COMMENT
2
Entering edit mode
10.3 years ago
5heikki 11k

Here's one way with Entrez Direct:

IFS=$'\n'; for next in $(cut -f2 refGene.txt); do NP=$(esearch -db nuccore -query $next | elink -target protein | efetch -format docsum | xtract -element Caption); echo "$next $NP"; done
ADD COMMENT
0
Entering edit mode

Thanks, I need to install some tools so will get back to you. Upvote.

Ps. other answers still welcome. Is there a file that contains these mappings somewhere?

ADD REPLY
0
Entering edit mode

Your previous, non-while version was better. This one includes an error (and only reads the first line anyways, see this post on StackOverflow.

ADD REPLY
0
Entering edit mode

My, bad. I edited the answer again.

ADD REPLY
2
Entering edit mode
9.3 years ago

Hello myself from the past! After having done similar tasks umpteen gazillion times you realized you should make them easier for yourself. Therefore you wrote biomartian (pip install biomartian:

echo "138 NM_016166 chr15 + 68346571 68480404 68346664 68480173 14" | sed 's/ /\t/g' | biomartian --noheader -c 1 -i refseq_mrna -o refseq_peptide -
0    1    2    3    4    5    6    7    8    refseq_peptide
138    NM_016166    chr15    +    68346571    68480404    68346664    68480173    14    NP_057250

(The code before the biomartian call is just used to replace spaces with tabs).

See https://github.com/endrebak/biomartian for more

ADD COMMENT
0
Entering edit mode

this example does not work in my hands. The refseq_peptide returns blank.

ADD REPLY
1
Entering edit mode
10.3 years ago

The UCSC table to map transcripts to proteins is called refLink.

You can get it as a plain-text file by going to the UCSC table browser, selecting your genome of interest and then setting the group to "all tables". Next choose the table refLink, enter a filename and then press the "get output"-button.

Example output:

#name product mrnaAcc protAcc geneName prodName locusLinkId omimId
PAX2 paired box 2 NM_001282819 NP_001269748 133420 110035 102094402 0
MLPH melanophilin NM_001282836 NP_001269765 65090 92499 102093006 0
ADD COMMENT
0
Entering edit mode

Great! Your answer help me a lot!

ADD REPLY

Login before adding your answer.

Traffic: 1985 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6