I finished my GTF file, but now I am trying to use it in RNA-seq analysis to annotate the reads. The problem is that I merge 2 gtf obtained through different methods (Ensembl's reference genome, and a GTF created with Cufflinks from another RNA-seq). I am trying to merge the 2 gtf, but although the gene is only one, the exons are the same (and also the transcripts), being 2 isoforms with the same coordinates.
In this case, we have 2 genomes created with different methods that if the exons are the same, it probably means that they are the same exon.
I wanted the GTF to perform RNA-seq and annotate the reads. If two exons are overlapping, the HTSeq program will marked as ambiguous and won't count it as a valid read (look all the ambiguous counts).
__no_feature 454236
__ambiguous 2600391
__too_low_aQual 0
__not_aligned 335530
__alignment_not_unique 224705
3614862 not aligned
49608 aligned
Is there a tool that I can use to remove all overlapping transcripts?
Thank you!
Please don't post screenshots of data. People can't decipher them nor use information in them to copy for diagnostic purposes/tests.
Srry, I wanted to highlight this 4 numbers
Tagging: Juke34
AGAT cannot help here, as explained in the other thread it removes identical transcripts only if all level3 features (exon, CDS, UTR, etc) are identical between the transcripts. What might help is to factorize the annotation. I mean instead of
having
This is allowed in the GFF format (not in GTF). Like that each feature is present only once in the file but the Parent attribute shows that it is used by several transcripts... but not all tools can deal with that. I don't know any tool doing so. I wanted to implement that in AGAT but never had time.
An incomplete solution that might clean a part of your problem would be to predict the CDS within the cufflinks exons (i.e. using Transdecoder), and run agat_convert_sp_gxf2gxf.pl with merge_loci option, then if CDS and exons are similar between the transcripts, the duplicates will be removed.