Question

NCBI nucleotide ID to protein ID

0

Entering edit mode

5.6 years ago

mxlsherry1992 ▴ 80

Dear all, I have some NCBI nucleotide ID, about 4,000. I want to download it's protein sequence from NCBI. I know we can download the fasta format using "NCBI batch", but the problem is that I can not download the protein sequence using the nucleotide ID, unless I download that one by one, but it is impossible for 4,000 sequence....

So I just want to know if there is any method that I can transfer the NCBI nucleotide ID to protein ID? or if any resolution for that? the NCBI nucleotide ID looks like this: XM_017496492.1, and it's relative protein ID is: XP_017351981.1

Any advices will be greatly appreciated!

RNA-Seq Assembly genome sequence • 3.6k views

ADD COMMENT • link updated 3.2 years ago by Friederike 9.0k • written 5.6 years ago by mxlsherry1992 ▴ 80

0

Entering edit mode

interesting question.

Would simply blasting the nucleotide sequences you are interested in against nrprot (or a subset of it if the required IDs are from a single species or some other taxonomic subset), be an option.

ADD REPLY • link 5.6 years ago by lieven.sterck 15k

0

Entering edit mode

Hi, thanks for suggestions! I didn't address it clearly..this is actually a RNA sequencing data has a reference genome, I just find some interested gene and want to get the protein ID, then download the protein sequences, after that, I will use those protein sequences as the input file for Orthofinder. So in this case, maybe transfer the ID is much easier..?

ADD REPLY • link 5.6 years ago by mxlsherry1992 ▴ 80

0

Entering edit mode

OK, different issue indeed. I will leave it up to others to chip in here.

One remark though is that running things like Orthofinder on a subset of proteins will technically work but might (will?) bias the results. It's advisable to run those tools with an as complete set of proteins as you can.

EDIT : et voila, genomax has already provided a solution for this.

ADD REPLY • link 5.6 years ago by lieven.sterck 15k

score 2 · Answer 1 · 2019-05-06

2

Entering edit mode

5.6 years ago

GenoMax 147k

Using EntrezDirect:

$ esearch -db nuccore -query "XM_017496492.1" | elink -target protein | efetch -db protein -format acc
XP_017351981.1

To get actual sequence

$ esearch -db nuccore -query "XM_017496492.1" | elink -target protein | efetch -db protein -format fasta
>XP_017351981.1 PREDICTED: protein FAM19A1-like isoform X1 [Ictalurus punctatus]
MSWFLCLWIAVSCLVLCQATLYETIQQHHVPRPGRNAIQILEGGTCEVIAAHRCCNKNRIEERSQTVKCS
CLPGKVAGTTRNKPSCVDASIVIGKWWCEMEPCLEGEECKTLPDNSGWMCSSGNKIKTTRVRTSRPTHTI
YTHHTHTHTHTHIQTYSQ

ADD COMMENT • link 5.6 years ago by GenoMax 147k

0

Entering edit mode

I knew this was the way - but you can't beat genomax' speed.

In order to add something useful (in case you're up to some quick and dirty non informatics solution): Use an excel sheet with your ID's in column A and genomax's query in B, with the protein ID as variable, the copy to all rows. Then copy/paste to a shell script, take care of correct EOL. Done. It's not pure, but it get's you there

ADD REPLY • link 5.6 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

Hi, I got it! but if it be possible that I use a txt file or something like that as the input file,...if not, I will use an excel to creat multiple command line together and paste it in to a shell like you said :)

ADD REPLY • link 5.6 years ago by mxlsherry1992 ▴ 80

0

Entering edit mode

It worked! I tried this command "esearch -db nuccore -query "XM_017496492.1" | elink -target protein | efetch -db protein -format fasta" and it worked! But do you think I can use this command for multiple sequence download..?

esearch -db nuccore -query nucl_id | elink -target protein | efetch -db protein -format fasta > protein_out

the nucl_id is the input file containing all the nucleotide id, and make the output in a protein_out file.

ADD REPLY • link 5.6 years ago by mxlsherry1992 ▴ 80

1

Entering edit mode

Something like this should work:

$ cat nucl_id | epost -db nuccore -format acc | elink -target protein | efetch -format fasta >> protein_out

Make sure you sign up for NCBI-API-Keys first.

ADD REPLY • link 5.6 years ago by GenoMax 147k

1

Entering edit mode

or somthing like this

#!/usr/bin/env bash
while read -r nucl_id ;
do
    esearch -db nuccore -query ${nucl_id} | elink -target protein | efetch -db protein -format fasta >> protein_out;
    sleep 1; #prevents you from getting banned
done <$1

then you call it with, asuming one id per line.

script.sh idfile.txt

ADD REPLY • link 5.6 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

I will try it! thank you!

ADD REPLY • link 5.6 years ago by mxlsherry1992 ▴ 80

0

Entering edit mode

Good afternoon, I tried your code and it worked!! cat nucl_id | epost -db nuccore -format acc | elink -target protein | efetch -format fasta >> protein_out !! But I have one more question.. if it is possible that we get the output include both the protein ID and nucleotide ID? the output is like this, only containing the protein ID:

NP_001232866.1 creatine kinase M-type [Ictalurus punctatus] MTKNCNNDYKMKFPMEEEYPDLSLHNNHMSKVLTKDIYNKLRGKSTPSGFTLDDCIQTGVDNPGHPFIMT VGCVAGDEESYEVFKDLFDPIISDRHSGYKPTDKHHTDLNWENLKGGDDLDPNYVVSSRVRTGRSIKGFT LPPTNSRGERRAVEKLSIEALTSLDGEFKGKYYPLKDMTDKEQEQLIADHFLFDKPVSPLLLAAGMARDW PDARGIWHNDNKTFLVWVNEEDHLRVISMQKGGNMKEVFKRFCVGLQKIEEIFKKHNHGFMWNEHLGFVL TCPSNLGTGLRGGVHVKLPKLSTHPKFEEILTRLRLQKRGTGGVDTASVGGVFDISNADRLGSSEVQQVQ LVVDGVKLMVEMEKKLEKGESIDDMIPAQK

Thanks for your generous help!

ADD REPLY • link 5.6 years ago by mxlsherry1992 ▴ 80

0

Entering edit mode

I assume that the order of the nucl_id list will be the same as the order in the protein_out file so you should be able to simply link them.

ADD REPLY • link 5.6 years ago by lieven.sterck 15k

0

Entering edit mode

Thank you for reply! but I just find that the order is not totally same..it maybe due to some of my input ID don't have matched peotein sequence..and another problem is that if we directly search "XP_017352162.1" from NCBI, the protein ID is "XP_017352162.1" started with "XP", but if we use the code to download, the protein is like started with "NP"(NP_001232866.1 for example). I also find that the gene ID is more reliable..? (108281101), so if it is possible that we include the gene ID in the protein output file .....

ADD REPLY • link 5.6 years ago by mxlsherry1992 ▴ 80

0

Entering edit mode

yeah, fair enough, didn't think about that ...

waiting for some genomax 's magic then ;)

ADD REPLY • link 5.6 years ago by lieven.sterck 15k

0

Entering edit mode

Sadly no magic. With NCBI entrez once an ID is transformed into a new one that is no way to keep track of the original (AFAIK).

mxlsherry1992 : I suggest that you use sequence, annotation files available from the genome page to link these two ID's.

ADD REPLY • link 5.6 years ago by GenoMax 147k

0

Entering edit mode

Hi, I downloaded the Channel Catfish gff file from NCBI, however, there is nucleotide ID colume like "XM_017496642.1", and NCBI ID colume like "108281084", but there is not a colume contain protein ID for me to link...I also didn't find a file containing both nucleotide ID and protein ID for me to link and extract... please let me know if I missed anything, will be really appreciated! Thank you and have a great night!

ADD REPLY • link 5.6 years ago by mxlsherry1992 ▴ 80

0

Entering edit mode

#!/usr/bin/env bash
while read -r nucl_id ;
do
    echo  "###${nucl_id}" >>protein_out;
    esearch -db nuccore -query ${nucl_id} | elink -target protein | efetch -db protein -format fasta >> protein_out;

done <$1
sed 's/\n>/ /' <protein_out >protein_out_mod
sed -i 's/###/>/' protein_out_mod

slowly, this starts to become quite a hack... BTW, if genomax' cat command didn't get you banned you can drop the sleep

ADD REPLY • link 5.6 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

Hi, Thank for the script...I tried but it reported error:

line 9: $1: ambiguous redirect
line 10: protein_out: No such file or directory
sed: can't read protein_out_mod: No such file or directory

I also try to delect the last 3 lines, it didn't report but no output file ... could you kindly tell me if I made any mistakes for the script.. Thank you!

ADD REPLY • link 5.6 years ago by mxlsherry1992 ▴ 80

0

Entering edit mode

you have to paste it to a file, for example script.sh, then execute the script

script.sh idfile.txt

ADD REPLY • link 5.6 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

Hi, Thanks for your patience.. I tried to submit the script to the background with "qsub -F script.sh idfile.txt" and it reports a similar error

 line 9: nucl_id: No such file or directory
 line 10: protein_out: No such file or directory
 sed: can't read protein_out_mod: No such file or directory"

And if I tried to run this script without submit to the background, it also reports error and said "too many replecates",I am not sure what mistakes I made, and will be pretty appreciated if you could point it. Thanks and have a great weekend!

ADD REPLY • link 5.6 years ago by mxlsherry1992 ▴ 80

0

Entering edit mode

What do you mean by sending this to background? qsub is afaik a job queuing system to manage the load on bigger systems. Anyway, nucl_id doesn't seem to exist wherever you execute the script.

The second error seems to be issued by one of the elink scripts. Does it execute for a while and abort? What is the last entry in the file? Can you execute the last failing line as isolated code?