Question

Command Or Script To Generate An Annotation File For Blast2Go

1

Entering edit mode

11.7 years ago

lzsph ▴ 70

Dear all,

We have got RNA-Seq of a plant without a reference genome, so we de novo assembled its transcriptome using Trinity. Now we have an annotation file of this transcriptome. I want to generate GO (Gene Ontology) functional classification using Blast2GO. But re-mapping my Blast XML results in Blast2GO is time consuming (I tried that of my ~200,000 transcripts in my transcriptome).

An alternative way to generate GO function classification is importing the annotation file (with a suffix .annot according to Blast2GO manual), the .annot file looks like below (You can also get this example .annot file at http://www.blast2go.com/b2glaunch/resources, named b2g_example_files.zip): p.s. If you can't see this picture from Flickr, I also put it on GitHub

enter image description here

But our annotation file was not generated by Blast2GO, it's a CSV file like this below (a little mess):

X01_query_id, X06_hit_title, X07_molecular_function, X08_biological_process, X09_cellular_component ##header of the CSV file
comp1000113_c0_seq1,  Cc-nbs resistance protein [Medicago truncatula], GO:0043531 // ADP binding;GO:0005524 // ATP binding;GO:0017111 // nucleoside-triphosphatase activity, GO:0006952 // defense response,
comp10001_c0_seq1, Pistil-specific extensin-like protein [Medicago truncatula], , , 
comp1000255_c0_seq1, F-box protein [Medicago truncatula], , , 
comp1000736_c0_seq1, Alpha-L-arabinofuranosidase [Medicago truncatula],GO:0046556 // alpha-N-arabinofuranosidase activity, GO:0046373 // L-arabinose metabolic process,  
comp1000860_c0_seq1, Protein kinase [Medicago truncatula], GO:0005524 // ATP binding;GO:0004674 // protein serine/threonine kinase activity, ,

I need to bother you to help me provide some command or some scripts to generated a .annot file like this below:

Ignore those lines without GO IDs

p.s. If you can't see this picture from Flickr, I also put it on GitHub

enter image description here

I look forward to hearing from all of you soon.

Thank you and best regards,

lzsph

go • 7.0k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 11.7 years ago by lzsph ▴ 70

score 2 · Answer 1 · 2013-03-05

2

Entering edit mode

11.7 years ago

Whetting ★ 1.6k

Using python, this should work (If I understood what you wanted)

import re
out=open("test.annot","a")
with open("input.csv","rU") as f:
    for line in f:
        line=line.rstrip()
        if "GO" in line:
                x=1
                line=line.split(",")
                for m in re.findall("GO:\d{7}",str(line[2:])):
                    if x==1:
                        print >>out, line[0], m, line[1].rsplit(" [")[0]
                        x=x+1
                    else:
                        print >>out, line[0], m




out.close()

this gives:

comp1000113c0seq1 GO:0043531  Cc-nbs resistance protein
comp1000113c0seq1 GO:0005524  
comp1000113c0seq1 GO:0017111  
comp1000113c0seq1 GO:0006952 
comp1000736c0seq1 GO:0046556  Alpha-L-arabinofuranosidase
comp1000736c0seq1 GO:0046373  
comp1000860c0seq1 GO:0005524  Protein kinase
comp1000860c0seq1 GO:0004674

ADD COMMENT • link 11.7 years ago by Whetting ★ 1.6k

0

Entering edit mode

Hi Whetting,

Awesome. By the way, how to keep one protein name in each set? Like this below.

comp1000113c0seq1 GO:0043531 Cc-nbs resistance protein

comp1000113c0seq1 GO:0005524

comp1000113c0seq1 GO:0017111

comp1000113c0seq1 GO:0006952

Thank you very much!

Regards,

Lzsph

ADD REPLY • link 11.7 years ago by lzsph ▴ 70

0

Entering edit mode

I edited the code...this should give you what you want

ADD REPLY • link 11.7 years ago by Whetting ★ 1.6k

0

Entering edit mode

Hi Whetting, thanks again. It's perfect!!!

ADD REPLY • link 11.7 years ago by lzsph ▴ 70