I am trying to run the following gffcompare command:
gffcompare -r ref.gff -G -o merged stringtie-merged.gtf
ref.gff - downloaded from NCBI
string-merged.gtf - obtained from stringtie --merge
command
Error encountered : GFF Error: overlapping duplicate transcript feature (ID=gene29892)
When I grep "gene29892" from both the ref.gff and stringtie-merged.gtf
From ref.gff
NC_007957.1 RefSeq gene 74631 74744 . + . ID=gene29892;Dbxref=GeneID:4025012;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ViviCp045;part=1/2
NC_007957.1 RefSeq gene 146276 147073 . + . ID=gene29892;Dbxref=GeneID:4025012;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ViviCp045;part=2/2
NC_007957.1 RefSeq CDS 74631 74744 . + 0 ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1 RefSeq CDS 146276 146507 . + 0 ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1 RefSeq CDS 147048 147073 . + 2 ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1 RefSeq exon 146276 146507 . + . ID=id318095;Parent=gene29892;Dbxref=GeneID:4025012;exon_number=1;gbkey=exon;gene=rps12
NC_007957.1 RefSeq exon 147048 147073 . + . ID=id318096;Parent=gene29892;Dbxref=GeneID:4025012;exon_number=2;gbkey=exon;gene=rps12
From stringtie-merged.gtf
NC_007957.1 StringTie transcript 74631 147073 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; gene_name "rps12"; ref_gene_id "gene29892";
NC_007957.1 StringTie exon 74631 74744 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "1"; gene_name "rps12"; ref_gene_id "gene29892";
NC_007957.1 StringTie exon 146276 146507 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "2"; gene_name "rps12"; ref_gene_id "gene29892";
NC_007957.1 StringTie exon 147048 147073 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "3"; gene_name "rps12"; ref_gene_id "gene29892";
NC_007957.1 StringTie transcript 146276 147073 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; gene_name "rps12"; ref_gene_id "gene29892";
NC_007957.1 StringTie exon 146276 147073 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "1"; gene_name "rps12"; ref_gene_id "gene29892";
What could be possibly wrong as I cannot see any duplicate values! There are identical "starts" "stops" but the tags/labels are different.
But then they are indeed 2 different genes on the same scaffold. By "duplicates" I assume that the "ID", start and stop should be same. Anyway, thanks for the answer. Any suggestion on how to bypass this issue? The GFF3 file has been obtained from NCBI.
Additionally, from this link ReadMe.md file), it looks like GFF or GTF can be used without any issue.
For a program that reads a GFF/GFF3/GTF file, duplicate means "having the same ID", because that is the key which is used in the dictionary. Plus, in the definition of the GFF 9th field, you'll find that the "ID" must always be unique, while the "Name" can be not.
This is most likely a mistake by who made that GFF file, not everything you find on NCBI is glittering gold, it's always better to be careful!