Question

How to correctly format WEGO's input with Perl

1

Entering edit mode

8.2 years ago

biomonte ▴ 220

Hello! ~ I did a de-novo transcriptome assembly using trinity and then annotated transcripts using trinotate. I now want to take a look at the distribution of gene ontology (GO) categories among transcripts. I wanna use WEGO for this (although software suggestions are welcome!). I'm working / struggling with Perl to correctly format my input file for WEGO.

I have a three-column tab delimited file like this:

TRINITY_DN150378_c0_g6_i1   GO:0005905^cellular_component^coated pit`GO:0016021^cellular_component^integral component of membrane`GO:0005509^molecular_function^calcium ion binding`GO:0006897^biological_process^endocytosis   GO:0005515^molecular_function^protein binding`GO:0005509^molecular_function^calcium ion binding
TRINITY_DN151939_c1_g1_i4   GO:0016021^cellular_component^integral component of membrane`GO:0016020^cellular_component^membrane`GO:0005634^cellular_component^nucleus`GO:0001558^biological_process^regulation of cell growth`GO:0030856^biological_process^regulation of epithelial cell differentiation   .

And I want my file to look like this:

TRINITY_DN150378_c0_g6_i1   GO:0005905 GO:0016021 GO:0005509 GO:0006897 GO:0005515 GO:0005509
TRINITY_DN151939_c1_g1_i4   GO:0016021 GO:0016020 GO:0005634 GO:0001558 GO:0030856

This is what I wrote so far:

#! /usr/bin/perl -w

use strict;

my @array;
my $input_file = $ARGV[0] or die ;  
open (IN, $input_file) or die "Couldn't open file $input_file: $!\n";

while (<IN>) {      
    chomp($_);
    my @column = split /\t+/, $_;
    my @GO1 = split /`/ , $column[1];
    foreach my $GO_name (@GO1) {
        my @GO_id = split /\^/ , $GO_name;
        push @array, "$GO_id[0]" if ($GO_id[0] !~ m/^\./);  
    }   
}
foreach my $line ( @array ) {
    print $line, "\n";
}

close IN;

This produces a list of GO categories like so:

GO:0005905
GO:0016021
GO:0005509
GO:0006897
GO:0016021
GO:0016020
GO:0005634
GO:0001558

But this just includes the first of the two GO's columns and does not match transcript ID.

I would really appreciate if someone can help to complete this script (or to write it in a different way).

Thanks in advance! ☺

RNA-Seq perl wego gene ontology trinotate • 2.4k views

ADD COMMENT • link 8.2 years ago by biomonte ▴ 220

score 2 · Answer 1 · 2016-09-30

2

Entering edit mode

8.2 years ago

Israel Barrantes ▴ 790

I adapted the suggestions given here:

http://seqanswers.com/forums/showthread.php?t=20195

for a script I made to pass the ANNOT output of blast2go, to WEGO. Although be aware that the WEGO server has not updated its GO database since 2009, If I recall correctly.

ADD COMMENT • link 8.2 years ago by Israel Barrantes ▴ 790