Question

Extracting Gene Ids From Protein Ids

1

Entering edit mode

12.9 years ago

Assa Yeroslaviz ★ 1.9k

hi,

I have a tab-delimited table of of protein ids that looks like that:

45    FBpp0070037    
46    FBpp0070039;FBpp0070040    
47    FBpp0070041;FBpp0070042;FBpp0070043    
48    FBpp0070044;FBpp0110571    
...

For each of these protein Ids I would like to extract the gene id (Fbgn....) in a third column. the output table should looks like that:

45    FBpp0070037                          FBgn001234  
46    FBpp0070039;FBpp0070040              FBgn00094432;FBgn002345   
47    FBpp0070041;FBpp0070042;FBpp0070043  FBgn0001936;FBgn000102;FBgn004527   
48    FBpp0070044;FBpp0110571              FBgn0097234;FBgn00183   
...

I was thinking using biomaRt, but I could find a way of automating it for the complete protein ids in the line

I would appreciate your Ideas.

Thanks A.

r conversion biomart • 3.5k views

ADD COMMENT • link updated 12.9 years ago by Rm 8.3k • written 12.9 years ago by Assa Yeroslaviz ★ 1.9k

score 5 · Answer 1 · 2012-01-24

5

Entering edit mode

12.9 years ago

Rm 8.3k

Flybase: FBgn <=> FBtr <=> FBpp IDs (fbgn_fbtr_fbpp_*.tsv)

input_file.txt:

FBpp0070037    
FBpp0070039;FBpp0070040    
FBpp0070041;FBpp0070042;FBpp0070043    
FBpp0070044;FBpp0110571

cat input_file.txt | while read LINE; do echo -en "$LINE\t" >> out_fbpp2fbgn.txt ; fbpp="$(echo $LINE | cut -d";" -f1)" ; grep "$fbpp" fbgn_fbtr_fbpp_fb_2011_10.tsv |  awk ' BEGIN {OFS = FS = "\t"}{print$1}' >>out_fbpp2fbgn.txt; done

out_fbpp2fbgn.txt:

FBpp0070037     FBgn0010215
FBpp0070039;FBpp0070040 FBgn0052230
FBpp0070041;FBpp0070042;FBpp0070043     FBgn0000258
FBpp0070044;FBpp0110571 FBgn0053217

ftp://ftp.flybase.net/releases/current/precomputed_files/genes/fbgn_fbtr_fbpp_fb_2011_10.tsv.gz

ADD COMMENT • link 12.9 years ago by Rm 8.3k

0

Entering edit mode

the file is good, but not exactly what I was looking for. thanks.

ADD REPLY • link 12.9 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

can you be more specific

ADD REPLY • link 12.9 years ago by Rm 8.3k

0

Entering edit mode

my problem is not to get the data from biomaRt, but to get it and keep the structure of the table. If I'll run the column as one ID per line, I will have it than difficult to bring the IDs back to their right protein ID.

ADD REPLY • link 12.9 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

see the edited answer: to include your request

ADD REPLY • link 12.9 years ago by Rm 8.3k

score 1 · Answer 2 · 2012-01-31

Hi,

I'd suggest (see below) to use a python script to do the parsing. The code works accordingly to you have said and it removes duplicates entries IDs in the same line.

Hope this can help you.

from string import strip

for line in open(inFile, "rU"):
  fields = map(strip, line.split())

  ids = map(strip, fields[1].split(";"))
  genes = [singleID for singleID in ids if singleID.startswith("Fbgn")]
  others = set(ids) - set(genes)

  print ("%s\t%s\t%s") % (fields[0], ";".join(sorted(others)), ";".join(sorted(genes)))

score 1 · Answer 3 · 2012-01-31

You can download the index file that RM suggested. It has three columns where first column is the gene ID and third column is the protein ID. Then use a python script to use the index file to transform your protein ID list. So something like this:

import sys

indexFile = open(sys.argv[1],'r')
index = {}
for line in indexFile:
    if line[0] != "#" and line != "":
        data = line.strip().split('\t')
        if len(data) > 2:
            index[data[2]] = data[0]

pidFile = open(sys.argv[2],'r')

for line in pidFile:
    data = line.strip().split(';')
    output = ''
    for item in data:
        output += index[data[0]] + ';'
    print line.strip() + "\t" + output[:-1]

Save as myScript.py. Use by: python myScript.py indexFile proteinIDFile