Hello! ~ I did a de-novo transcriptome assembly using trinity and then annotated transcripts using trinotate. I now want to take a look at the distribution of gene ontology (GO) categories among transcripts. I wanna use WEGO for this (although software suggestions are welcome!). I'm working / struggling with Perl to correctly format my input file for WEGO.
I have a three-column tab delimited file like this:
TRINITY_DN150378_c0_g6_i1 GO:0005905^cellular_component^coated pit`GO:0016021^cellular_component^integral component of membrane`GO:0005509^molecular_function^calcium ion binding`GO:0006897^biological_process^endocytosis GO:0005515^molecular_function^protein binding`GO:0005509^molecular_function^calcium ion binding
TRINITY_DN151939_c1_g1_i4 GO:0016021^cellular_component^integral component of membrane`GO:0016020^cellular_component^membrane`GO:0005634^cellular_component^nucleus`GO:0001558^biological_process^regulation of cell growth`GO:0030856^biological_process^regulation of epithelial cell differentiation .
And I want my file to look like this:
TRINITY_DN150378_c0_g6_i1 GO:0005905 GO:0016021 GO:0005509 GO:0006897 GO:0005515 GO:0005509
TRINITY_DN151939_c1_g1_i4 GO:0016021 GO:0016020 GO:0005634 GO:0001558 GO:0030856
This is what I wrote so far:
#! /usr/bin/perl -w
use strict;
my @array;
my $input_file = $ARGV[0] or die ;
open (IN, $input_file) or die "Couldn't open file $input_file: $!\n";
while (<IN>) {
chomp($_);
my @column = split /\t+/, $_;
my @GO1 = split /`/ , $column[1];
foreach my $GO_name (@GO1) {
my @GO_id = split /\^/ , $GO_name;
push @array, "$GO_id[0]" if ($GO_id[0] !~ m/^\./);
}
}
foreach my $line ( @array ) {
print $line, "\n";
}
close IN;
This produces a list of GO categories like so:
GO:0005905
GO:0016021
GO:0005509
GO:0006897
GO:0016021
GO:0016020
GO:0005634
GO:0001558
But this just includes the first of the two GO's columns and does not match transcript ID.
I would really appreciate if someone can help to complete this script (or to write it in a different way).
Thanks in advance! ☺