Merge 2 GTF using agat avoiding overlapping
0
2
Entering edit mode
3.6 years ago
Rafael Soler ★ 1.3k

I finished my GTF file, but now I am trying to use it in RNA-seq analysis to annotate the reads. The problem is that I merge 2 gtf obtained through different methods (Ensembl's reference genome, and a GTF created with Cufflinks from another RNA-seq). I am trying to merge the 2 gtf, but although the gene is only one, the exons are the same (and also the transcripts), being 2 isoforms with the same coordinates.

enter image description here

In this case, we have 2 genomes created with different methods that if the exons are the same, it probably means that they are the same exon.

I wanted the GTF to perform RNA-seq and annotate the reads. If two exons are overlapping, the HTSeq program will marked as ambiguous and won't count it as a valid read (look all the ambiguous counts).

__no_feature    454236
__ambiguous 2600391
__too_low_aQual 0
__not_aligned   335530
__alignment_not_unique  224705

3614862 not aligned

49608 aligned

Is there a tool that I can use to remove all overlapping transcripts?

Thank you!

Merge GTF GFF agat • 2.3k views
ADD COMMENT
0
Entering edit mode

Please don't post screenshots of data. People can't decipher them nor use information in them to copy for diagnostic purposes/tests.

ADD REPLY
0
Entering edit mode

Srry, I wanted to highlight this 4 numbers

ADD REPLY
0
Entering edit mode

Tagging: Juke34

ADD REPLY
0
Entering edit mode

AGAT cannot help here, as explained in the other thread it removes identical transcripts only if all level3 features (exon, CDS, UTR, etc) are identical between the transcripts. What might help is to factorize the annotation. I mean instead of

GL897338.1  ensembl transcript  69095   70039   .   -   .   ID=ENSMPUT00000019709;Parent=ENSMPUG00000019557;
GL897338.1  ensembl exon    69095   70039   .   -   .   ID=ENSMPUE00000195156;Parent=ENSMPUT00000019709;
GL897338.1  ensembl CDS 69095   70039   .   -   0   ID=cds-188326;Parent=ENSMPUT00000019709;
GL897338.1  Cufflinks   transcript  69095   70039   .   -   .   ID=TCONS_00071361;Parent=ENSMPUG00000019557
GL897338.1  Cufflinks   exon    69095   70039   .   -   .   ID=exon-585762;Parent=TCONS_00071361

having

GL897338.1  ensembl transcript  69095   70039   .   -   .   ID=ENSMPUT00000019709;Parent=ENSMPUG00000019557;
GL897338.1  Cufflinks   transcript  69095   70039   .   -   .   ID=TCONS_00071361;Parent=ENSMPUG00000019557
GL897338.1  ensembl exon    69095   70039   .   -   .   ID=ENSMPUE00000195156;Parent=ENSMPUT00000019709,TCONS_00071361;
GL897338.1  ensembl CDS 69095   70039   .   -   0   ID=cds-188326;Parent=ENSMPUT00000019709;

This is allowed in the GFF format (not in GTF). Like that each feature is present only once in the file but the Parent attribute shows that it is used by several transcripts... but not all tools can deal with that. I don't know any tool doing so. I wanted to implement that in AGAT but never had time.


An incomplete solution that might clean a part of your problem would be to predict the CDS within the cufflinks exons (i.e. using Transdecoder), and run agat_convert_sp_gxf2gxf.pl with merge_loci option, then if CDS and exons are similar between the transcripts, the duplicates will be removed.

ADD REPLY

Login before adding your answer.

Traffic: 1830 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6