I'd like to find the protein encoded by my refgene transcripts.
I have the refGene.txt file where each line looks like the following:
138 NM_016166 chr15 + 68346571 68480404 68346664 68480173 14 68346571,68378643,68434283,68434627,68438153,68438903,68445927,68457068,68466069,68467974,68468811,68473549,68475967,68479879, 68346688,68379088,68434368,68434675,68438244,68439038,68446033,68457142,68466230,68468105,68468992,68473692,68476005,68480404, 0 PIAS1 cmpl cmpl 0,0,1,2,2,0,0,1,0,2,1,2,1,0,
I'd like to know how I can get the NP_XXXX
name for the transcript NM_016166
.
Preferably, I'd like to just get a text file mapping the two in some way, but I'll accept any answer that does this automatically, e.g. with BioPython (having to look it up by hand in a browser or some such doesn't cut it - I need to do this for 50K transcripts).
I'm working with hg38, but I'm guessing the procedure is the same for all major genome versions, so I did not specify to make the Q as general as possible.
Thanks, I need to install some tools so will get back to you. Upvote.
Ps. other answers still welcome. Is there a file that contains these mappings somewhere?
Your previous, non-while version was better. This one includes an error (and only reads the first line anyways, see this post on StackOverflow.
My, bad. I edited the answer again.