I'm trying to map sample against Xenopus Laevis from Xenbase.org (latest version 6.0). They provided GFF3 and FASTA file. which look like following
FASTA
> 27051543
ATGGCGGATGTGAAGGTCTCGTTCCAGTGCCCAGGCCGGATGTACAGCCCCGCGTGGGTGGCACCTGAGGCGCTGCAGAA
ACGCCCAGAGGATATTAACCGTCGCTCTGCTGACATGTGGAGTTTTGCCGTTCTGCTTTGGGAGCTGGTGACCCGCGAGG
TTCCATTTGCCGACCTCTCAAACATGGAGATTGGCATGAAGGTTTCCCTTGAAGGCCTCCGTCCCACCATCCCCCCCGGG
ATCTCGCCCCATATCTGCAAGTTGATGAAGATTTGTATGAACGAAGACCCTGCCAAGCGACCCAAGTTTGATATGATCGC
CCCCATCCTGGAGAAGATGCAGGAGAAATAA
> 27051545
TTTGGACTGTGCGTGAATTTAAAGAAAGCAGACAAATTCTTCCCGCGTTGCTATAACCTGGCGGATAAAACAGGGAGAAT
GTTATTCACTGATGACTTCATGAAAACTGCAGCGTATAGTATCATAAAATGGGTTGTAACAAGAAACAGTACGCCTATTA
AAGCAGAAGCCAATGTAATTTTAATGGCTTTTATGGTCTGCAAAATGTTCATGATTCCCTCAGTAAATAAGGACATAGAC
GFF3
##gff-version 3
Scaffold100041 JGI_gene gene 2092 20066 . + . ID=XeXenL6RMv10000001m.g;Name=XeXenL6RMv10000001m.g
Scaffold100041 JGI_gene mRNA 2092 20066 . + . ID=PAC:27060736;Name=XeXenL6RMv10000001m;pacid=27060736;longest=1;Parent=XeXenL6RMv10000001m.g
Scaffold100041 JGI_gene five_prime_UTR 2092 2223 . + . ID=PAC:27060736.five_prime_UTR.1;Parent=PAC:27060736;pacid=27060736
Scaffold100041 JGI_gene five_prime_UTR 2490 2505 . + . ID=PAC:27060736.five_prime_UTR.2;Parent=PAC:27060736;pacid=27060736
Scaffold100041 JGI_gene CDS 2506 2585 . + 0 ID=PAC:27060736.CDS.1;Parent=PAC:27060736;pacid=27060736
Scaffold100041 JGI_gene CDS 4114 4216 . + 1 ID=PAC:27060736.CDS.2;Parent=PAC:27060736;pacid=27060736
Scaffold100041 JGI_gene CDS 4370 4449 . + 0 ID=PAC:27060736.CDS.3;Parent=PAC:27060736;pacid=27060736
Scaffold100041 JGI_gene CDS 6233 6422 . + 1 ID=PAC:27060736.CDS.4;Parent=PAC:27060736;pacid=27060736
Scaffold100041 JGI_gene CDS 7542 7700 . + 0 ID=PAC:27060736.CDS.5;Parent=PAC:27060736;pacid=27060736
So the GFF3 use PAC:XXXXXXX as the ID however, the FASTA didn't. On Tophat2 mapping process
/bin/map2gtf --sam-header ./tophat_out/tmp/Scaffold10.nucleotide_genome.bwt.samheader.sam /tmp/Simbiot_HSS/index/Scaffold10.nucleotide.gff - ./tophat_out/tmp/left_kept_reads.m2g.bam
Error is
[samopen] SAM header is present: 43025 sequences.
GList error (GList.hh:981):Invalid list index: 27078510
when i tried to convert to GTF. it have following error
Can't locate object method "display_text" via package "Bio::Annotation::SimpleValue" at /usr/local/share/perl5/Bio/SeqFeature/Annotated.pm line 703, <GEN0> line 2.
The convert code looks like this
#! /usr/bin/perl
use lib '/local/ensembl/bioperl-live';
use warnings;
use Bio::FeatureIO;
$in = Bio::FeatureIO->new(-file => "/tmp/Simbiot_HSS/index/Scaffold10.nucleotide.gff3" , -format => 'GFF');
$out = Bio::FeatureIO->new(-file => ">/tmp/Simbiot_HSS/index/test.gtf" ,
-format => 'GTF');
while ( my $feature = $in->next_feature() ) {
$out->write_feature($feature);
}
exit(0);
Is i missing something?
@Istvan Thank you very much. tophat itself pick GFF file. I only provide the prefix path like "/tmp/Simbiot_HSS/index/Scaffold10.nucleotide" then it auto pickup gff (may be version 2 or 3). I did also try to convert that GFF3 to GTF but got error. see above.