Entering edit mode
4.9 years ago
Juke34
8.9k
AGAT - Another Gff/Gtf Analysis Toolkit
Suite of tools to handle gene annotations in any GTF/GFF format. Available through conda and Docker for an easy install/usage.
Why AGAT?
- The main idea was first to be able to parse all possible GTF/GFF versions along with all possible underlying flavors that can be met. (I listed more than 30 cases).
To my knowledge AGAT is the only one able to handle all of them. How? By parsing in three ways concomitantly with different priority:- i) using parent/child relationship
- ii) using a common tag to group features together (an attribute from the 9th column sharing same "locus" value)
- iii) using sequential approach (e.g. all exon are attach to the last gene met if none of the two first approach have worked)
The second idea was to be able to create a full standardised GFF3 file that could actually fit in any tool. AGAT exels compared to many tools in creating the missing information:
- missing features (gene, mRNA, tRNA, exon, UTRs, etc...)
- missing attributes (ID, Parent)
and fixing wrong information:
- identifier to be unique.
- feature location (e.g mRNA will be stretched if shorter than its exons).
- remove duplicated features.
- group related features (if spread in different places in the file).
- sort features.
- merge overlapping loci (if option activate because for prokaryote is not something we would like)
- The third idea was to have a correct topological sorting output. To my knowledge AGAT is the only one dealing properly with this task. More information about it here.
- Finally, based on the abilities described previously I have developed a toolkit to perform different tasks. Some are originals, some are similars than what other tools might offer, but within AGAT they have the strength of the 3 first points.
Few examples among the >50 tools available:
- check, fix, pad missing information into sorted and standardised:
agat_convert_sp_gxf2gxf.pl
- make statistics:
agat_sp_statistics.pl
- extract any type of sequence:
agat_sp_extract_sequences.pl
- complement annotations (non-overlapping loci):
agat_sp_complement_annotations.pl
- merge annotations:
agat_sp_merge_annotations.pl
- filter gene models by ORF size:
agat_sp_filter_by_ORF_size.pl
- filter to keep only longest isoforms:
agat_sp_keep_longest_isoform.pl
- create introns features:
agat_sp_add_introns.pl
- fix cds phases:
agat_sp_fix_cds_phases.pl
- extract attributes:
agat_sp_extract_attributes.pl
- manage IDs:
agat_sp_manage_IDs.pl
- convert into tabulated format:
agat_sp_to_tabulated.pl
- specificity sensitivity:
agat_sp_sensitivity_specificity.pl
- fusion / split analysis between two annotations:
agat_sp_compare_two_annotations.pl
- analyze differences between BUSCO results:
agat_sp_compare_two_BUSCOs.pl
[...]
You should follow the more modern development model where tools work via subcommands rather than polluting the namespace with hard to discover script names.
instead of:
it should be:
running:
should produce a list of commands with short descriptions for each.
The approach was first introduced to bioinformatics by
bwa
then adopted bybedtools
and other frameworks.Thank you for the feedback, your are right! Colleagues told me the same but I have been lazy and didn't implement it (yet). It something I should do for version 1.0.0... if I find time to work on it :)
Hi Juke34
I am using AGAT for a gff3 to gtf conversion and I saw some of the isoforms removed "28 identical isoforms removed". Why should an isoform be removed, I don't understand especially if gtf has to be used in RNASeq. Will it make any difference in the counts' quantification both for genes and transcripts (if TranscriptSAM option is also used to get transcript counts in STAR)?
If they are identical in all points another isoform I don't see the reason to keep it. It is most likely an error while creating the file/annotation. It should not make any difference for genes quantification. For transcripts well the isoforms removed will not have any counts because they will be absent from your file, so if you are tracking specifically one of those isofomrs that could become problematic. But as they are identical they should have same information hold in it. Just the ID will differs. So in such specific case you just have to track the other identical isoforms.
Wonderful. Indeed they had identical exon start and end and transcript st too. Thanks Jacques. You really saved the day. The AGAT was otherwise smooth and took care of the problems we were previously facing with other converters. Thanks a load again.