I currently have two GTFs which are incomplete. What I mean by this?. For example, in one of them I have 5000 genes and in another I have 7000, however, there are 3000 genes that coincide (they are the same). Therefore, I would like to create a GTF with the 5000 + 7000 genes, but without duplicating the 3000 identical genes, so I would have in the end a GTF with 9000 genes (not 12000).
What could I do? I think both Cuffmerge and Cuffcompare are useless in these cases.
agat_sp_compare_annotations.pl will tell you the differences.
Then you can choose to complement the annotation using agat_sp_complement_annotations.pl, that means you use one annotation as reference and you will pick from the second annotation only non-overlaping predictions and add them to the reference annotation (i.e. for overlapping loci only the gene prediction from the reference is kept)
Or to merge annotation using agat_sp_merge_annotations.pl, that mean you take non overlaping locus from both annotations and you merge the annotation when they are overlapping (overlapping genes will be merged as one locus and different mRNA will be seen as isoforms, duplicates mRNA will be removed to keep one)
I'm trying to put the two GTFs together, but I've noticed that GTF 1 only has exons, so it doesn't count as having any genes, CDs... etc because it was created by a RNAseq of the hole genome.
What can I do to annotate it correctly? Because like this with functions agat_sp_complement_annotations.pl and agat_sp_merge_annotations.pl can not count genes.
Could you show few lines of the original file? You will probably need to create proper gene feature first using an appropriate attribute (using agat_convert_sp_gxf2gxf.pl with --locus_tag).
I think gffread does not deal with all feature types, e.g. 3utr, 5utr, start_codon, stop_codon, (ncRNA?) etc. will be removed. If you just need gene,transcipt,exon,cds it should be fine.
You can merge any GTF/GFF annotation together with AGAT. The first step of most of the AGAT tools is to fix the file to standardize it. (e.g. a file containing only exon will be modified to create mRNA and gene features).
If you need to use the --locus_tag option you will have to process the file agat_convert_sp_gxf2gxf.pl first before running any other tool.
The only problem that might occur is when you have a file containing only level3 feature (level1=gene, level2=mRNA,level3=CDS,exon,UTR,etc ) and no relationship described. That might be the case you described previously. Lets look into details:
Case1 only feature level3 but ID/Parent(GFF) or gene_id/transcript_id(GTF) relationship:
=> This is WRONG. All features level 3 will be attached to a uniq gene.
So, I hope you didn't end up in this case3 which is very rare (It should not exists because this type of input is actually not a GFF or GTF but I have already seen it).
locus_tag and gene_id are already used by default by AGAT. No need to do -c locus_tag, use '-c' only if you want to use another attribute as comon tag.
The output you show from agat_sp_merge_annotations.pl is unexpected. Could you send me over the two files you used as input? Or only the part containing those records?
Ok I see the problem. Merge locus works if CDS of a gene overlaps the CDS of another gene, or if an exon of a gene (that do not hane any CDS) overlaps the exon of another gene (that do not have CDS neither). But In your case you have one gene with CDS and the other gene wihtout CDS but only exon.
I will have to update that, it should be merged but only if the level2 is the same for both loci (here transcript).
I will open an issue in the repo.
I saw another thing that explain why AGATdidn't succeed to deal with your cufflinks file properly.
It detects two possible version version of the GFF3 and GFF2, and decide to use GFF3 parser while it should use GFF2.
It can be fixed using --gvi 2
I have tried it, and now it only counts as gene one of the two if it is duplicated in coordinates (now it counts the real genes that we have), however, it keeps leaving it as transcripts the 2, and my goal is that if these two transcripts are duplicated and they agree on the coordinates, that in the final file there were not 2 transcribed with the same coordinates. Any suggestion?
The thing here is that RNA and transcript is seen as two different level2 features, so it is not merged together. RNA is created by AGAT. I should probably modify it to call it transcript (but we would have mRNA we would have same problem that avoid feature to be merged, they must be called the same way.).
So you should first parse the cufflinks output MergedWalsh.gtf with agat_convert_sp_gxf2gxf.pl:
then replace RNA from the third column of the output by transcript (awk '{if($3=="RNA"){$3="transcript"} print $0}' MergedWalsh_clean.gff > MergedWalsh_clean_transcript.gff) and rerun the merge script using MergedWalsh_output_transcript.gff instead of MergedWalsh.gtf.
And after transforming the GTF reference using agat_convert_sp_gxf2gxf.pl --g Mustela_putorius_furo.MusPutFur1.0.100_with_607genes.gtf -o genome_with_607genes_perl.gff
I merged both using agat_sp_merge_annotations.pl --gff MergedWalsh_perl.gff --gff genome_with_607genes_perl.gff --out Walsh_with_607_genes_final2, giving me the result that I mentioned in the answer before.
Look at the final result, it does not have 'RNA' in level2 features. Now the "gene" level2 feature is not duplicated, but the transcript it is still the same.
It would be very useful (not only to detect identical isoforms but also to delete them), because in this case probably the second one also had a CDS, but as far as we started from a GTF with all exons MergedWalsh.gtf:
And gffread only annotated the gene and transcript, we do not have the CDS, but probably it is the same transcript because the length of the transcript and the exon is the same.
For me it is okay because I wanted to increase the coverage of a RNA-seq, and have at the end more genes (the isoforms do not matter at all) but I wanted to know how was the best way to create the GFF.
Thank you and we will keep in contact if you add this extra tool to detect identical isoforms :)
And with the reference of NCBI with agat_sp_statistics.pl --gff GCF_000215625.1_MusPutFur1.0_genomic.gtf -o NCBI_mitochondrial_perl_stat
--------------------------------------------------------------------------------
Compute mrna with isoforms if any
Number of genes 19681
Number of mrnas 48095
Number of mrnas with utr both sides 44009
Number of mrnas with at least one utr 46661
Number of cdss 48095
Gene_id is not unique within the file. So you merge genes together. You must parse your file without looking at any relationship but grouping feature I the order they appear (sequentially). To do that you must remove the use of a common tag value. Add “--ct fake” to your command.
Compare annotation tool
agat_sp_compare_annotations.pl
from AGAT toolkit. Install AGAT usingconda
.Thank you!! I will try it :)