Parse GFF to infer numbers of all exons, and just coding exons per gene
1
0
Entering edit mode
9.1 years ago
Anand Rao ▴ 640

Are there any tools that can parse GFF and report the numbers of all exons, as well as the numbers of just the coding exons, for each gene in the GFF for that species?

I am working on multiple plant species, not human or even mammalian.

If such tool(s) exist, could you help with details on which function and some example syntax, please? Thanks.

coding GFF exon • 2.2k views
ADD COMMENT
2
Entering edit mode
9.1 years ago
jason ▴ 160

Assuming well behaved GFF3 this perl code gets you there.

$ perl count_CDS.pl annotation.gff3

my %genes;
while(<>) {
 chomp;
 my @row = split(/\t/,$_);
 next unless  $row[2] eq 'CDS';
 my %group = map { split(/=/,$_) } split(/;/,pop @row);
 $genes{$group{Parent}}++;
}

#sorted by genes with the most number of CDS to least, though you could just sort by ID too
for my $gene  ( sort { $genes{$b} <=> $genes{$a} } keys %genes ) {
 print join("\t",$gene, $genes{$gene}), "\n";
}
ADD COMMENT
0
Entering edit mode

Thanks, that works very well to report CDS counts for each PACid. All 48 GFFs I parsed behaved OK, your script was super fast. Though the script does not report counts for all exons, and just the counts for all coding exons (right?),
I think the parsed output should be enough for me to test my hypothesis. Thanks again!

ADD REPLY
2
Entering edit mode

yes I leave it as an exercise to the reader to add/replace the 'CDS' part with 'exon' above or modify the code to track both CDS and exon ...

ADD REPLY

Login before adding your answer.

Traffic: 1677 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6