Did anyone succeed to convert the Genscan and Fgenesh output format to GFF and GTF ? I have found few on net but non of them is working. If you have a parser, please share it
Did anyone succeed to convert the Genscan and Fgenesh output format to GFF and GTF ? I have found few on net but non of them is working. If you have a parser, please share it
Have you tried Bioperl's Bio::Tools::Genscan and Bio::Tools::Fgenesh parsers? Both of these are a Bio::SeqAnalysisParserI, so you will be able to obtain SeqFeatureI from them. In combination with Bio::Tools::GFF you will be able to make GFF2 or GTF.
For-instance, you can produce some version of GFF from fgenesh
output with this script
#!/usr/bin/env perl
# PURPOSE: parse fgenesh output into gff
# USAGE: fgenesh fish somefish.dna | fgenesh2gff > somefish.dna.fgenesh.gff
use strict;
use warnings;
use Bio::Tools::Fgenesh; # to parse output into feature
use Bio::Tools::GFF;
# Remaining options should name files to process, but if none, process
# standard input:
@ARGV = ('-') unless @ARGV;
my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV);
my $featureout = new Bio::Tools::GFF(-gff_version=>2);
my $IDNUM = 0;
while (my $gene = $fgenesh->next_prediction()) {
my $ID = $gene->seq_id . "_fgenesh_" . ++ $IDNUM;
$gene->add_tag_value('ID', $ID);
foreach ($gene->features) {
$_->add_tag_value('Parent', $ID);
$_->seq_id($gene->seq_id);
$featureout->write_feature($_);
}
}
$fgenesh->close();
exit 0;
... which will give you output like:
LanFP_DNA34 Fgenesh Poly_A_site 1224 1224 1.26 - . Parent LanFP_DNA34_fgenesh_1 ; score "1.26"
LanFP_DNA34 Fgenesh TerminalExon 1844 2024 26.02 - 2 Parent LanFP_DNA34_fgenesh_1 ; score "26.02"
LanFP_DNA34 Fgenesh InternalExon 2492 2622 20.19 - 0 Parent LanFP_DNA34_fgenesh_1 ; score "20.19"
LanFP_DNA34 Fgenesh InternalExon 3243 3342 25.15 - 2 Parent LanFP_DNA34_fgenesh_1 ; score "25.15"
LanFP_DNA34 Fgenesh InternalExon 3517 3668 18.92 - 0 Parent LanFP_DNA34_fgenesh_1 ; score "18.92"
LanFP_DNA34 Fgenesh InternalExon 4184 4276 11.42 - 0 Parent LanFP_DNA34_fgenesh_1 ; score "11.42"
LanFP_DNA34 Fgenesh InternalExon 4569 4694 14.86 - 0 Parent LanFP_DNA34_fgenesh_1 ; score "14.86"
LanFP_DNA34 Fgenesh InitialExon 5384 5566 2.09 - 0 Parent LanFP_DNA34_fgenesh_1 ; score "2.09"
.... when run on input like:
FGENESH 2.4 Prediction of potential genes in Fish genomic DNA
Time : Mon Jul 10 14:18:02 2006
Seq name: LanFP_DNA34 Clipped to 31-5694
Length of sequence: 5663
Number of predicted genes 1 in +chain 0 in -chain 1
Number of predicted exons 7 in +chain 0 in -chain 7
Positions of predicted genes and exons: Variant 1 from 1, Score:105.654358
G Str Feature Start End Score ORF Len
1 - PolA 1224 1.26
1 - 1 CDSl 1844 - 2024 26.02 1844 - 2023 180
1 - 2 CDSi 2492 - 2622 20.19 2494 - 2622 129
1 - 3 CDSi 3243 - 3342 25.15 3243 - 3341 99
1 - 4 CDSi 3517 - 3668 18.92 3519 - 3668 150
1 - 5 CDSi 4184 - 4276 11.42 4184 - 4276 93
1 - 6 CDSi 4569 - 4694 14.86 4569 - 4694 126
1 - 7 CDSf 5384 - 5566 2.09 5384 - 5566 183
Predicted protein(s):
>FGENESH: 1 7 exon (s) 1844 - 5566 321 aa, chain -
MIHPTKICFTALGSKCADIGTVVHRIRVLFCPLKTDSSGQWPSGWSVRLTYTYCRFDSIT
FETPPTRYTRERHKKALPGTAPHFPNKLSSRVHPRPAKIRATMPLPATHDIHLHGSINGH
EFDMVGGGKGDPNAGSLVTTAKSTKGALKFSPYLMIPHLGYGYYQYLPYPDGPSPFQTSM
LEGSGYAVYRVFDFEDGGKLTTEFKYSYEGSHIKADMKLMGSGFPDDGPVMTSQIVDQDG
CVSKKTYLNNNTIVDSFDWSYNLQNGKRYRARVSSHYIFDKPFSADLMKKQPVFVYRKCH
VKASKTEVTLDEREKAFYELA
Tweak and repeat
DAWGPAWS does what you wanted. Aside from support for gene prediction programs, there are also parsers for transposable element predictions. Most of the annotations files generated are in the GFF format.
If I try the fgenesh
result without fasta seq in it, the program uses the first contig as the seq_id
for all genes. With fasta seq in fgenesh
output, it through the error:
------------- EXCEPTION -------------
MSG: Attempting to set the sequence to BLABLA which does not look healthy.
STACK Bio::PrimarySeq::seq /perl/5.8.8/Bio/PrimarySeq.pm:283
STACK Bio::Tools::Fgenesh::next_prediction /perl/5.8.8/Bio/Tools/Fgenesh.pm:247
STACK DAWGPAWS::fgenesh2gff /opt/dawgpaws-1.1/scripts/cnv_fgenesh2gff.pl:286
STACK toplevel /opt/dawgpaws-1.1/scripts/cnv_fgenesh2gff.pl:216
Any Idea ??
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
When I am trying with multiple genes in the same contig, its only give the output of first gene. Have you faced that?
hmmmm - I don't recall that being an issue at all. I seem to recall something related... namely that fgenesh only processes the first sequence in a multi-fasta file.... but that is not what you are experiencing. Good luck.
is it possible to parse the CDS and the protein sequence with this module