Question

How to get protein ID from gene ID (batch entrez)

0

Entering edit mode

10.4 years ago

alansoffan • 0

Hi

Can someone suggest me how to get protein ID from gene ID (batch entrez).

I have hundreds of gene name like AaeL_AAEL004207 with gene ID 5564359. Manually we can get the protein ID one by one, the problem I have hundreds of that, obviously it seem not a good idea, any one can suggest me..?

Thanks

gene • 7.7k views

ADD COMMENT • link updated 3.5 years ago by yuxia_sc • 0 • written 10.4 years ago by alansoffan • 0

0

Entering edit mode

Thanks a lot for the suggestions,..well I haven't try that hopefully it will work

ADD REPLY • link 10.4 years ago by alansoffan • 0

Ram · Accepted Answer · 2014-12-04

3

Entering edit mode

10.4 years ago

5heikki 11k

With Entrez Direct:

epost -db gene -id 5564359 | elink -target protein | efetch -format uid
157105044

You can include multiple gene IDs (at least 500) in the -id part, separated by commas. Here's a script:

#!/bin/bash
exist=$(which epost)
if [ $(echo $? != 0) ]
then
echo "Entrez Direct not in \$PATH"
exit
fi

if [ -n &quot;$1&quot; ]
then
split -l 500 $1 input.

for f in input.*
do
ids=$(cat $f | tr "\n" ",")
epost -db gene -id $ids | elink -target protein | efetch -format uid > $f.output
paste $f $f.output > $f.result
rm $f $f.output
done

cat *.result > $1.output
rm *.result

else
echo "Usage: sh convertGeneIDs listOfGeneIDs\nOutput: geneID\tproteinID"
fi

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 10.4 years ago by 5heikki 11k

0

Entering edit mode

I was puzzled by the

if [ -n "$1" ]

line, which turns out to mean "if non-empty string"

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 10.4 years ago by Nancy Ouyang ▴ 170

0

Entering edit mode

non-empty first argument ;)

ADD REPLY • link 10.4 years ago by 5heikki 11k

0

Entering edit mode

Moi Heikki,

Thank you for writing this script! May I ask you more details about it?

Kiitos Paljon!

Best wishes, Xia

ADD REPLY • link 3.5 years ago by yuxia_sc • 0

0

Entering edit mode

This was 7 years ago, I certainly wouldn't write it the same way now. Anyway, sure ask away..

ADD REPLY • link 3.5 years ago by 5heikki 11k

0

Entering edit mode

Thank you very much, Heikki.

I have a large csv file containing protein IDs from 30 samples and counts for each protein ID of individual samples. I would like to use entrez direct to search each protein ID for the specific bacterial species. My supervisor mentioned that she adapted the script found online and gave it to me to use. Then, I found your script on Biostar. I am new to this field, so my questions may be very silly to you. Hope you won't mind.

exist=$(which epost) Shall I define the csv file instead of using which epost?

for f in input.* Shall I define the input file?

Many thanks again, Heikki. If it is possible, may I contact you by email? My email address is x.yu2@leeds.ac.uk.

Best wishes, Xia

ADD REPLY • link 3.5 years ago by yuxia_sc • 0

1

Entering edit mode

exist=$(which epost) checks that the entrez tools have been installed (well epost, but here it's assumed that if epost is in $PATH, then so is everything else that the script needs)

As to defining the input file, you just save the script to a file, e.g. convertGeneIds, then you make it executable chmod +x convertGeneIds, and then you can use it ./convertGeneIds inputFile

The script assumes that the input file is a list of ids, one on each line, no commas or anything like that

ADD REPLY • link 3.5 years ago by 5heikki 11k

0

Entering edit mode

Thank you very much for your detailed reply, Heikki. I understand better now. I ran the script, but it showed that "command not found" for each ID.

The first column of our input file is a list of ids, followed by samples names, which is delimited by comma, as shown below:

input file

The numbers in the input files are counts for each ID in individual samples. The script used is:

script used

Could you give me some suggestions to improve it? Many thanks again.

ADD REPLY • link 3.5 years ago by yuxia_sc • 0

1

Entering edit mode

You need to isolate just the ID. Script expects only ID's to be present nothing else. You can do the following to extract the ID's from column 1.

cut -d "," -f1 yourfile > new_file.

Then use the new_file with @5heikki's script as input.

Note: Don't post screenshots of code in comments. Always copy and paste the actual code.