How to correctly format WEGO's input with Perl
1
1
Entering edit mode
8.2 years ago
biomonte ▴ 220

Hello! ~ I did a de-novo transcriptome assembly using trinity and then annotated transcripts using trinotate. I now want to take a look at the distribution of gene ontology (GO) categories among transcripts. I wanna use WEGO for this (although software suggestions are welcome!). I'm working / struggling with Perl to correctly format my input file for WEGO.

I have a three-column tab delimited file like this:

TRINITY_DN150378_c0_g6_i1   GO:0005905^cellular_component^coated pit`GO:0016021^cellular_component^integral component of membrane`GO:0005509^molecular_function^calcium ion binding`GO:0006897^biological_process^endocytosis   GO:0005515^molecular_function^protein binding`GO:0005509^molecular_function^calcium ion binding
TRINITY_DN151939_c1_g1_i4   GO:0016021^cellular_component^integral component of membrane`GO:0016020^cellular_component^membrane`GO:0005634^cellular_component^nucleus`GO:0001558^biological_process^regulation of cell growth`GO:0030856^biological_process^regulation of epithelial cell differentiation   .

And I want my file to look like this:

TRINITY_DN150378_c0_g6_i1   GO:0005905 GO:0016021 GO:0005509 GO:0006897 GO:0005515 GO:0005509
TRINITY_DN151939_c1_g1_i4   GO:0016021 GO:0016020 GO:0005634 GO:0001558 GO:0030856

This is what I wrote so far:

#! /usr/bin/perl -w

use strict;

my @array;
my $input_file = $ARGV[0] or die ;  
open (IN, $input_file) or die "Couldn't open file $input_file: $!\n";

while (<IN>) {      
    chomp($_);
    my @column = split /\t+/, $_;
    my @GO1 = split /`/ , $column[1];
    foreach my $GO_name (@GO1) {
        my @GO_id = split /\^/ , $GO_name;
        push @array, "$GO_id[0]" if ($GO_id[0] !~ m/^\./);  
    }   
}
foreach my $line ( @array ) {
    print $line, "\n";
}

close IN;

This produces a list of GO categories like so:

GO:0005905
GO:0016021
GO:0005509
GO:0006897
GO:0016021
GO:0016020
GO:0005634
GO:0001558

But this just includes the first of the two GO's columns and does not match transcript ID.

I would really appreciate if someone can help to complete this script (or to write it in a different way).

Thanks in advance! ☺

RNA-Seq perl wego gene ontology trinotate • 2.4k views
ADD COMMENT
2
Entering edit mode
8.2 years ago

I adapted the suggestions given here:

http://seqanswers.com/forums/showthread.php?t=20195

for a script I made to pass the ANNOT output of blast2go, to WEGO. Although be aware that the WEGO server has not updated its GO database since 2009, If I recall correctly.

ADD COMMENT

Login before adding your answer.

Traffic: 1692 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6