How to get protein ID from gene ID (batch entrez)
1
0
Entering edit mode
10.0 years ago
alansoffan • 0

Hi

Can someone suggest me how to get protein ID from gene ID (batch entrez).

I have hundreds of gene name like AaeL_AAEL004207 with gene ID 5564359. Manually we can get the protein ID one by one, the problem I have hundreds of that, obviously it seem not a good idea, any one can suggest me..?

Thanks

gene • 7.3k views
ADD COMMENT
0
Entering edit mode

Thanks a lot for the suggestions,..well I haven't try that hopefully it will work

ADD REPLY
3
Entering edit mode
10.0 years ago
5heikki 11k

With Entrez Direct:

epost -db gene -id 5564359 | elink -target protein | efetch -format uid
157105044

You can include multiple gene IDs (at least 500) in the -id part, separated by commas. Here's a script:

#!/bin/bash
exist=$(which epost)
if [ $(echo $? != 0) ]
then
echo "Entrez Direct not in \$PATH"
exit
fi

if [ -n "$1" ]
then
split -l 500 $1 input.

for f in input.*
do
ids=$(cat $f | tr "\n" ",")
epost -db gene -id $ids | elink -target protein | efetch -format uid > $f.output
paste $f $f.output > $f.result
rm $f $f.output
done

cat *.result > $1.output
rm *.result

else
echo "Usage: sh convertGeneIDs listOfGeneIDs\nOutput: geneID\tproteinID"
fi
ADD COMMENT
0
Entering edit mode

I was puzzled by the

if [ -n "$1" ]

line, which turns out to mean "if non-empty string"

ADD REPLY
0
Entering edit mode

non-empty first argument ;)

ADD REPLY
0
Entering edit mode

Moi Heikki,

Thank you for writing this script! May I ask you more details about it?

Kiitos Paljon!

Best wishes, Xia

ADD REPLY
0
Entering edit mode

This was 7 years ago, I certainly wouldn't write it the same way now. Anyway, sure ask away..

ADD REPLY
0
Entering edit mode

Thank you very much, Heikki.

I have a large csv file containing protein IDs from 30 samples and counts for each protein ID of individual samples. I would like to use entrez direct to search each protein ID for the specific bacterial species. My supervisor mentioned that she adapted the script found online and gave it to me to use. Then, I found your script on Biostar. I am new to this field, so my questions may be very silly to you. Hope you won't mind.

exist=$(which epost) Shall I define the csv file instead of using which epost?

for f in input.* Shall I define the input file?

Many thanks again, Heikki. If it is possible, may I contact you by email? My email address is x.yu2@leeds.ac.uk.

Best wishes, Xia

ADD REPLY
1
Entering edit mode

exist=$(which epost) checks that the entrez tools have been installed (well epost, but here it's assumed that if epost is in $PATH, then so is everything else that the script needs)

As to defining the input file, you just save the script to a file, e.g. convertGeneIds, then you make it executable chmod +x convertGeneIds, and then you can use it ./convertGeneIds inputFile

The script assumes that the input file is a list of ids, one on each line, no commas or anything like that

ADD REPLY
0
Entering edit mode

Thank you very much for your detailed reply, Heikki. I understand better now. I ran the script, but it showed that "command not found" for each ID.

The first column of our input file is a list of ids, followed by samples names, which is delimited by comma, as shown below:

input file

The numbers in the input files are counts for each ID in individual samples. The script used is:

script used

Could you give me some suggestions to improve it? Many thanks again.

ADD REPLY
1
Entering edit mode

You need to isolate just the ID. Script expects only ID's to be present nothing else. You can do the following to extract the ID's from column 1.

cut -d "," -f1 yourfile > new_file.

Then use the new_file with @5heikki's script as input.

Note: Don't post screenshots of code in comments. Always copy and paste the actual code.

ADD REPLY
0
Entering edit mode

Thank you very much for your suggestions!

ADD REPLY

Login before adding your answer.

Traffic: 2178 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6