I am getting the following error using cuffmerge (2.2.1):
[Mon Apr 18 07:07:41 2016] Beginning transcriptome assembly merge
-------------------------------------------
[Mon Apr 18 07:07:41 2016] Preparing output location cuffmerge/
[Mon Apr 18 07:07:57 2016] Converting GTF files to SAM
[07:07:57] Loading reference annotation.
GFF Error: duplicate/invalid 'transcript' feature ID=id102945
[FAILED]
Error: could not execute gtf_to_sam
The reference GFF came from NCBI. Here's what I get if I grep for "ID=id102945":
NW_015493306.1 Gnomon C_gene_segment 47653 72466 . - . ID=id102945;Parent=gene4402;Dbxref=GeneID:107506276;gbkey=C_region;gene=LOC107506276;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns
Any suggestions? I can't seem to find anything. I didn't have any issues when running tophat or cufflinks using the same GFF.
For what it is worth, I am unable to validate the file using the GFF validator at genometools.org (with the Seq Ontology option selected):
Validation unsuccessful!
GenomeTools error: the child feature with type 'V_gene_segment' on line 17186 in file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/_9407.gff3.gz" is not part-of parent feature with type 'gene' given on line 17185 (according to type checker 'OBO file /home/satta/genometools_for_web/gtdata/obo_files/so.obo')
EDIT: Still no luck, the "filtered" GTF cause tophat to present errors. I found a different GTF available for the same genome, NCBI has several. However, I am still getting an error about the same entry: [Fri Apr 22 07:49:38 2016] Converting GTF files to SAM
[07:49:38] Loading reference annotation.
GFF Error: duplicate/invalid 'transcript' feature ID=id102945
[FAILED]
Error: could not execute gtf_to_sam
The entry in question (new GFF):
NW_015493310.1 Gnomon exon 2165474 2165578 . + . ID=id102945;Parent=rna8992;Dbxref=GeneID:107506309,Genbank:XM_016136996.1;gbkey=mRNA;gene=REV3L;product=REV3 like%2C DNA directed polymerase zeta catalytic subunit%2C transcript variant X3;transcript_id=XM_016136996.1
What I am having trouble understanding is that this error happens if I supply cuffmerge a reference sequence, reference GFF, both or nothing. So I'm assuming it is a problem with the output of cufflinks.
I just tried running it again, but without supplying a reference GFF or genomic FASTA. I get the same error, so it looks like the problem lies with the GTF files generated by cufflinks:
However when I grep the transcripts.gtf files from my samples, nothing pops up.
I'll give filtering a try, really hoping that I don't have to start over...
EDIT: I've tried filtering a few times, but without any success. After doing some more hunting, I noticed that cufflinks has a utility called gffread, which is able to parse/filter/validate GFF files. I've generated a "fixed" GFF file that passes validation, but cuffmerge is still failing. I'll start over from scratch with this new GFF, hopefully that will work.
Hi Joe, did you manage to make it work eventually? I came up to the same conclusions as you: the NCBI gtf file isn't problematic, but cufflinks gtf are. I removed the "duplicate/invalid" featureID from them, still same error message popped up with another problematic feature.
gffread transcripts.gtf
gives the same error message:GFF Error: duplicate/invalid 'transcript' feature ID=rna67847
Yes, let me dig up what I ended up doing. You can't simply remove the problematic features since they might be the parent of something else. You sort of have to rebuild it and filter out those features as you go.
I think it is a mixture of both NCBI and the cufflinks GFF/GTF parsing. What is really problematic is that the documentation claims that all cuff* uses gtfread to parse GFF/GTF files, yet they don't all throw the same errors (or any errors) when parsing.