Hi, I'm having trouble reformatting the .gff3 output from Interproscan to .gtf, I used agat but it gave an error about repeated ID's. This online validator gives a similar error message: 'sequence region "tig00000001_377 (...) has already been defined'. Looking back in the original file, there are three instances of this sequence region, but each has a distinct ORF, like so:
##gff-version 3
##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.269
##interproscan-version 5.50-84.0
##sequence-region tig00000001_377 1 2496
tig00000001_377 provided_by_user nucleic_acid 1 2496 0 + 0 ID=tig00000001_377;Name=tig00000001_377;md5=7d26a317c817503d101bf1feadcb2f93
tig00000001_377_orf11327 getorf ORF 2042 2347 0 - 0 Target=pep_tig00000001_377_2042_2347_r 1 102;ID=orf_tig00000001_377_2042_2347_r;Name=tig00000001_377_orf11327;md5=7d26a317c817503d101bf1feadcb2f93
##sequence-region tig00000001_377 1 2496
tig00000001_377 provided_by_user nucleic_acid 1 2496 0 + 0 ID=tig00000001_377;Name=tig00000001_377;md5=7d26a317c817503d101bf1feadcb2f93
tig00000001_377_orf11359 getorf ORF 3 266 0 - 0 Target=pep_tig00000001_377_3_266_r 1 88;ID=orf_tig00000001_377_3_266_r;Name=tig00000001_377_orf11359;md5=7d26a317c817503d101bf1feadcb2f93
Each also has polypeptide and protein_match features. I've never done this before, so I'm not sure how to proceed. Can the ORFs be somehow combined? Also- the third instance of this sequence region has many protein matches, mostly similar ('Ribonuclease' or 'Rnase').
Additionally, I tried gff3tools to 'fix' my gff3, but since the ORF is included in the ID column, the IDs in the .gff3 did not match the contig names in the original .fna file. Do I have to change the ID column too?
Lastly- this is actually just a small subset of the data. I wanted to run inteproscan on a metagenome but due to the large size I had to break it up into ~500 smaller files (using the command recommended here). So unfortunately I can't manually check what is happening with repeated sequences in each file.
In short- how can I make this into a valid .gff3, so that it can be turned into a .gtf? Any help will be greatly appreciated!
What agat commands are you using? Are you trying this?
I had used agat_convert_sp_gff2gtf.pl. After your comment I tried agat_sp_keep_longest_isoform.pl, but the resulting file only contained the original file headers, plus the header of one sequence region. The script also gave the following warning messages:
and
So I took a look at the Sequence Ontology terms, and found that ORF is actually a term in it, but nothing I saw exactly matched 'nucleic acid'. Then I tried updating the json files and for feature 3, the file looks like this:
Would it make sense to add "ORF":"exon, and "nucleic_acid":"exon"?
If your file does contains only
ORF
andnucleic_acid
and they are independant, the way to go is to put them in thefeatures_level1.json
in this way:Then it will be processed properly i.e: duplicated nucleic_acid will be removed and ORFs would have uniq identifier