Command Or Script To Generate An Annotation File For Blast2Go
1
1
Entering edit mode
11.7 years ago
lzsph ▴ 70

Dear all,

We have got RNA-Seq of a plant without a reference genome, so we de novo assembled its transcriptome using Trinity. Now we have an annotation file of this transcriptome. I want to generate GO (Gene Ontology) functional classification using Blast2GO. But re-mapping my Blast XML results in Blast2GO is time consuming (I tried that of my ~200,000 transcripts in my transcriptome).

An alternative way to generate GO function classification is importing the annotation file (with a suffix .annot according to Blast2GO manual), the .annot file looks like below (You can also get this example .annot file at http://www.blast2go.com/b2glaunch/resources, named b2g_example_files.zip): p.s. If you can't see this picture from Flickr, I also put it on GitHub

enter image description here

But our annotation file was not generated by Blast2GO, it's a CSV file like this below (a little mess):

X01_query_id, X06_hit_title, X07_molecular_function, X08_biological_process, X09_cellular_component ##header of the CSV file
comp1000113_c0_seq1,  Cc-nbs resistance protein [Medicago truncatula], GO:0043531 // ADP binding;GO:0005524 // ATP binding;GO:0017111 // nucleoside-triphosphatase activity, GO:0006952 // defense response,
comp10001_c0_seq1, Pistil-specific extensin-like protein [Medicago truncatula], , , 
comp1000255_c0_seq1, F-box protein [Medicago truncatula], , , 
comp1000736_c0_seq1, Alpha-L-arabinofuranosidase [Medicago truncatula],GO:0046556 // alpha-N-arabinofuranosidase activity, GO:0046373 // L-arabinose metabolic process,  
comp1000860_c0_seq1, Protein kinase [Medicago truncatula], GO:0005524 // ATP binding;GO:0004674 // protein serine/threonine kinase activity, ,

I need to bother you to help me provide some command or some scripts to generated a .annot file like this below:

Ignore those lines without GO IDs

p.s. If you can't see this picture from Flickr, I also put it on GitHub

enter image description here

I look forward to hearing from all of you soon.

Thank you and best regards,

lzsph

go • 7.0k views
ADD COMMENT
2
Entering edit mode
11.7 years ago
Whetting ★ 1.6k

Using python, this should work (If I understood what you wanted)

import re
out=open("test.annot","a")
with open("input.csv","rU") as f:
    for line in f:
        line=line.rstrip()
        if "GO" in line:
                x=1
                line=line.split(",")
                for m in re.findall("GO:\d{7}",str(line[2:])):
                    if x==1:
                        print >>out, line[0], m, line[1].rsplit(" [")[0]
                        x=x+1
                    else:
                        print >>out, line[0], m




out.close()

this gives:

comp1000113c0seq1 GO:0043531  Cc-nbs resistance protein
comp1000113c0seq1 GO:0005524  
comp1000113c0seq1 GO:0017111  
comp1000113c0seq1 GO:0006952 
comp1000736c0seq1 GO:0046556  Alpha-L-arabinofuranosidase
comp1000736c0seq1 GO:0046373  
comp1000860c0seq1 GO:0005524  Protein kinase
comp1000860c0seq1 GO:0004674
ADD COMMENT
0
Entering edit mode

Hi Whetting,

Awesome. By the way, how to keep one protein name in each set? Like this below.

comp1000113c0seq1 GO:0043531 Cc-nbs resistance protein

comp1000113c0seq1 GO:0005524

comp1000113c0seq1 GO:0017111

comp1000113c0seq1 GO:0006952

Thank you very much!

Regards,

Lzsph

ADD REPLY
0
Entering edit mode

I edited the code...this should give you what you want

ADD REPLY
0
Entering edit mode

Hi Whetting, thanks again. It's perfect!!!

ADD REPLY

Login before adding your answer.

Traffic: 1987 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6