How can I easily remove overlapping transcripts, keeping only longest transcript, in a GFF file.
2
0
Entering edit mode
6.2 years ago
a.rex ▴ 350

I have recently annotated a genome using mikado. However, the resultant gff file contains loci that are overlapping but not annotated as isoforms: i.e.

chr17   Mikado_loci     ncRNA_gene      1014098 1014976 17      -       .       ID=Fly1;Name=Fly1;multiexonic=True;superlocus=Mikado_superlocus:chr17-:1014059-1015028
chr17   Mikado_loci     ncRNA   1014098 1014976 17      -       .       ID=Fly1.1;Parent=Fly1;Name=all_MSTRG.6472.4;alias=all_MSTRG.6472.4;canonical_junctions=1,2,3;canonical_number=3;canonical_proportion=1.0;p$
chr17   Mikado_loci     exon    1014098 1014296 .       -       .       ID=Fly1.1.exon1;Parent=Fly1.1
chr17   Mikado_loci     exon    1014343 1014675 .       -       .       ID=Fly1.1.exon2;Parent=Fly1.1
chr17   Mikado_loci     exon    1014720 1014871 .       -       .       ID=Fly1.1.exon3;Parent=Fly1.1
chr17   Mikado_loci     exon    1014919 1014976 .       -       .       ID=Fly1.1.exon4;Parent=Fly1.1
chr17   Mikado_loci     ncRNA_gene      1014165 1015048 17      +       .       ID=Fly2;Name=Fly2;multiexonic=True;superlocus=Mikado_superlocus:chr17+:1014165-1015048
chr17   Mikado_loci     ncRNA   1014165 1015048 17      +       .       ID=Fly2.1;Parent=Fly2;Name=aaa_MSTRG.5952.1;alias=aaa_MSTRG.5952.1;canonical_number=0;canonical_on_reverse_strand=True;canonical_proportio$
chr17   Mikado_loci     exon    1014165 1014296 .       +       .       ID=Fly2.1.exon1;Parent=Fly2.1
chr17   Mikado_loci     exon    1014343 1015048 .       +       .       ID=Fly2.1.exon2;Parent=Fly2.1

You see that they are both nested, but annotated apart. How can I resolve this, keeping only the longest transcript? If I keep both, this will play with my protein predictions - giving me a result of two proteins instead of 1.

gff parse • 9.6k views
ADD COMMENT
0
Entering edit mode

Expected output would help better understand the issue

ADD REPLY
6
Entering edit mode
5.0 years ago
Juke34 8.9k

From AGAT you can use:

agat_convert_sp_gxf2gxf.pl --gff myFile.gff --merge_loci -o myFile_lociMerged.gff
/!\ since v1.0.0 to activate merge_loci you must activate the option in the config yaml file

To merge the overlapping loci (same strand same type e.g an mRNA will not be merged with an overlapping tRNA).
Then to keep only the longest isoforms:

agat_sp_keep_longest_isoform.pl --gff myFile_lociMerged.gff -o myFile_lociMerged_longestIsoform.gff

Then if you wish to extract the protein sequences:

agat_sp_extract_sequences.pl --gff myFile_lociMerged_longestIsoform.gff -f genome.fa -p -o myFile_lociMerged_longestIsoform.fa

ADD COMMENT
0
Entering edit mode

Juke34 Is there a way to remove these nested overlapping genes using AGAT?

enter image description here

ADD REPLY
1
Entering edit mode

You want to remove the G604 gene?

ADD REPLY
0
Entering edit mode

Juke34 Yes, both G604 and g_109671, to keep only one gene (G603) per gene loci.

ADD REPLY
1
Entering edit mode

If they are not removed by the method explained above, they do not overlap in their CDS regions. For G604 it is obvious, I would not remove that gene. This type of nested gene is common, why do you want to remove to, these are two different genes?
For g_109671 it sounds like a different case, if they are kept by AGAT (still following the protocol above) it means either they it is not the same strand as G603, or they are not of the same type, e.g. ncRNA and mRNA, or transcript and mRNA. In the second case this is term issue, you should be sure to use an homogeneous way to call the same feature (3rd column).

ADD REPLY
0
Entering edit mode

Juke34 Got it on nested genes. Thanks for the clarification. Regarding CDS regions, does AGAT only collapse overlapping genes that have same number of CDS regions. For example, in case of "G604.1" and "anno1.jg73177.t1", they appears to be differ by only one CDS. (This appears to be due to term issue (e.g. gene, or transcripts) so wondering if there's a way to fix it using AGAT?). note-these are protein-coding transcripts.

ADD REPLY
1
Entering edit mode

--merge_lociwill merge 2 genes in one if a part of the CDS overlap and are in the same direction and level2 feature is the same feature type (as rule: level1>level2>level3).
So in the following case if CDS overlap and the level2 feature type is the same (i.e mRNA) the two following genes: gene1 > mrna1 > [exon1,cds1]
and
gene2 > mrna2 > [exon2,cds2]
will become:
gene1 > [mrna1, mrna2] > [ [exon1,cds1], [exon2,cds2] ]

"G604.1" and "anno1.jg73177.t1" are two isoforms of the same gene, there is nothing to merge with the --merge_loci option... but using agat_sp_keep_longest_isoform.pl if both have CDS only the longest will be kept

ADD REPLY
0
Entering edit mode

Juke34 Thanks again. I tried using agat_keep_longest_isoforms.pl, to keep longest isoform, but it doesn't seems to work despite both having CDS. An example of .gtf coordinate of overlapping isoforms is shown below;

Super-Scaffold_1    MAKER   gene    2902921 2909355 .   -   .   ID=STRG.23;Name=ORF type:complete len:152 
Super-Scaffold_1    MAKER   mRNA    2902921 2909355 .   -   .   ID=STRG.23.2.p1;Parent=STRG.23;Name=ORF type:complete len:152
Super-Scaffold_1    MAKER   exon    2902921 2903359 .   -   .   ID=STRG.23.2.p1.exon4;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   exon    2907239 2907364 .   -   .   ID=STRG.23.2.p1.exon3;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   exon    2908968 2909161 .   -   .   ID=STRG.23.2.p1.exon2;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   exon    2909251 2909355 .   -   .   ID=STRG.23.2.p1.exon1;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   CDS 2903179 2903359 .   -   1   ID=cds.STRG.23.2.p1;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   CDS 2907239 2907364 .   -   1   ID=cds.STRG.23.2.p1;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   CDS 2908968 2909161 .   -   0   ID=cds.STRG.23.2.p1;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   CDS 2909251 2909298 .   -   0   ID=cds.STRG.23.2.p1;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   five_prime_UTR  2909299 2909355 .   -   .   ID=STRG.23.2.p1.utr5p1;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   three_prime_UTR 2902921 2903178 .   -   .   ID=STRG.23.2.p1.utr3p1;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   gene    2903179 2909298 1   -   .   ID=g_384490
Super-Scaffold_1    MAKER   transcript  2903179 2909298 1   -   .   ID=anno2.g32766.t2;Parent=g_384490
Super-Scaffold_1    MAKER   exon    2903179 2903359 .   -   .   ID=exon-47;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   exon    2907239 2907364 .   -   .   ID=exon-48;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   exon    2908968 2909161 .   -   .   ID=exon-49;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   exon    2909251 2909298 .   -   .   ID=exon-50;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   CDS 2903179 2903359 1   -   1   ID=cds-59;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   CDS 2907239 2907364 1   -   1   ID=cds-60;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   CDS 2908968 2909161 1   -   0   ID=cds-61;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   CDS 2909251 2909298 1   -   0   ID=cds-62;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   intron  2903360 2907238 1   -   .   ID=intron-38;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   intron  2907365 2908967 1   -   .   ID=intron-39;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   intron  2909162 2909250 1   -   .   ID=intron-40;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   start_codon 2909296 2909298 .   -   0   ID=start_codon-18;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   stop_codon  2903179 2903181 .   -   0   ID=stop_codon-23;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
ADD REPLY
1
Entering edit mode

AGAT takes in consideration the level2 feature that must be the same type. Here you have one called transcript and the other one mRNA. Change all transcript into mRNA or the other way around.

ADD REPLY
0
Entering edit mode

Juke34 Thanks again. I changed the name of transcript to mRNA to be consistent, but it is still not able to remove the isoform. Please see the example of an output from the same coordinates;

Super-Scaffold_1    MAKER   gene    2902921 2909355 .   -   .   ID=STRG.23;Name=ORF type:complete len:152 
Super-Scaffold_1    MAKER   mRNA    2902921 2909355 .   -   .   ID=STRG.23.2.p1;Parent=STRG.23;Name=ORF type:complete len:152
Super-Scaffold_1    MAKER   exon    2902921 2903359 .   -   .   ID=STRG.23.2.p1.exon4;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   exon    2907239 2907364 .   -   .   ID=STRG.23.2.p1.exon3;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   exon    2908968 2909161 .   -   .   ID=STRG.23.2.p1.exon2;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   exon    2909251 2909355 .   -   .   ID=STRG.23.2.p1.exon1;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   CDS 2903179 2903359 .   -   1   ID=cds.STRG.23.2.p1;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   CDS 2907239 2907364 .   -   1   ID=cds.STRG.23.2.p1;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   CDS 2908968 2909161 .   -   0   ID=cds.STRG.23.2.p1;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   CDS 2909251 2909298 .   -   0   ID=cds.STRG.23.2.p1;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   five_prime_UTR  2909299 2909355 .   -   .   ID=nbis-five_prime_utr-8;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   three_prime_UTR 2902921 2903178 .   -   .   ID=nbis-three_prime_utr-41;Parent=STRG.23.2.p1
Super-Scaffold_1    MAKER   gene    2903179 2909298 1   -   .   ID=g_384490
Super-Scaffold_1    MAKER   mRNA    2903179 2909298 1   -   .   ID=anno2.g32766.t2;Parent=g_384490
Super-Scaffold_1    MAKER   exon    2903179 2903359 .   -   .   ID=exon-47;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   exon    2907239 2907364 .   -   .   ID=exon-48;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   exon    2908968 2909161 .   -   .   ID=exon-49;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   exon    2909251 2909298 .   -   .   ID=exon-50;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   CDS 2903179 2903359 1   -   1   ID=cds-59;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   CDS 2907239 2907364 1   -   1   ID=cds-60;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   CDS 2908968 2909161 1   -   0   ID=cds-61;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
Super-Scaffold_1    MAKER   CDS 2909251 2909298 1   -   0   ID=cds-62;Parent=anno2.g32766.t2;gene_id=g_384490;transcript_id=anno2.g32766.t2
ADD REPLY
1
Entering edit mode

hmm What version of AGAT are you using? In version >=1.0.0 you need to activate the merge_loci option within the config.yaml file (agat config --expose to make it appears)

ADD REPLY
0
Entering edit mode

its version 1.0.0. and installed using conda. How do I configure config.yaml file in such case?

ADD REPLY
1
Entering edit mode

Change merge_loci to true

ADD REPLY
0
Entering edit mode

Juke34 Thanks again. I changed the config.yaml should overlapping loci (at CDS level) be merged in a single locus Only one gene is kept, and the mRNA features become isoforms. merge_loci: true

However, it is still not recognizing the megre_loci option. See the error msg below.

$ agat_convert_sp_gxf2gxf.pl --gff input.gff --merge_loci -o ouput.gff

Using config.yaml file found in your working directory.

Unknown option: merge_loci

Failed to parse command line

ADD REPLY
1
Entering edit mode

Don't use --merge_loci in your command line this was for AGAT version <1.0.0 (and only few scripts), now this is handled via the config.yaml and will apply this for almost any script you will use.

ADD REPLY
0
Entering edit mode

Thank you Juke34 . It worked. However, In few cases where gene loci contain two opposite sense (+ and -) mRNAs/genes, its keeping both. Do you have any suggestion for keeping only one mRNA/gene in such cases?

ADD REPLY
0
Entering edit mode

Interesting, why would you merge genes that are on different strands? They should be different genes. It cannot be done without modifying the code and it does not sound an easy task…

ADD REPLY
2
Entering edit mode
6.2 years ago
Malcolm.Cook ★ 1.5k

For starters, ncRNA probably stands for non-coding RNA so you should expect zero proteins from that sample of GFF

But even if there were coding... what would be your problem? Presumably you chose mikado for its predictions.... why do you want to throw any out? (FWIW: I don't know mikado other than as a comic opera)

Sigh, assuming you have your reason... (and you won't be alone)

If you program R, you could:

  • use the GenomicFeatures library, load your GFF into a transcript database, txdb
  • query txdb for all transcripts
  • use findOverlaps to build a table representing the relationship of overlapping
  • compute the Transitive closure of this relationship (table) using an implementation of Floyd–Warshall algorithm - I think there is one in igraph and RGBL librarys. This will give you the overlapping groups of transcripts.
  • Take the longest in each group.

I would advise you to plan ahead a little further before taking this route. What is your end game? Why exactly do you want to find the "longest".

ADD COMMENT
0
Entering edit mode

The example I gave above is for ncRNA, but that same is observed for genes. My problem is that as they are annotated as separate loci, extracting sequences and running Transdecoder to find proteins will result in 2 potential genes/proteins instead of 1.

ADD REPLY
1
Entering edit mode

Well, perhaps you want to include a real example of your problematic case.

In any case, why would you want to discard one of the predictions when from your two predictions of overlapping genes on opposite strands?

I just first time now looked at Mikado which I now understand to be a highly configurable tool to select "best transcripts" from among many alternate versions produced by sequencing/prediction/assembly etc. By my reading, if mikado outputs overlapping predictions on opposite strands, then, that is its considered opinion, based on your configuration, of a reasonable interpretation of the input data.

But wait... looking more closely at your results, I think we are ignoring something that mikado has already highlighted. Look at the annotation in column 9 : "canonical_on_reverse_strand". It is already telling you that this transcript is "suspicious". I'm guessing you are not using mikado to its fullest extent, or are sharing one of its intermediate results with us, but that in fact there is a way to get mikado to remove that from what you take further in your analysis, or at least for you to filter further yourself based on that annotation.

If you provide true example of the problem from a coding gene we might be able to help or comment further. As it stands, I think you have a problem of not understanding your tool fully, and/or presenting misleading examples of your problem.

ADD REPLY
0
Entering edit mode

If there is a non R implementation that you are aware of, that would be helpful. With GenomicFeatures, the strandedness seems to be important, and fails if there is a '.' as opposed to a +/-

ADD REPLY
1
Entering edit mode

I have Perl approach (two scripts to run) if you are interested in. But you need to have bioperl and few Perl modules.

ADD REPLY
0
Entering edit mode

That would be very useful, thanks

ADD REPLY
1
Entering edit mode

Install AGAT with conda (conda install -c bioconda agat)

Merge overlapping loci:
agat_sp_gxf_to_gff3.pl --gff input.gff --merge_loci -o output.gff

Keep longest isoform per locus:
agat_sp_keep_longest_isoform.pl.

ADD REPLY
0
Entering edit mode

Thank you - I have started it, hoping the output resolves my issue. Should it work for my case where IDs are different (i.e. not isoforms) and delete the shorter sequence?

ADD REPLY
0
Entering edit mode

Yes. before to select the longest isoform per locus, (if you have set the $overlapCheck to 1), all the features are checked for overlaps and gathered into a same locus when there is an overlap of 2 features of the same type. By the same type I mean as example an overlap between an mRNA and a tRNA will not be considered.

ADD REPLY
0
Entering edit mode

even if they are on different strands?

ADD REPLY
0
Entering edit mode

No they have to be on the same strand too.

ADD REPLY
0
Entering edit mode

No they have to be on the same strand too.

ADD REPLY
0
Entering edit mode

GenomicFeatures is fully configurable as to treatment of strand. Please advise in what way it "fails". I have found otherwise.

ADD REPLY

Login before adding your answer.

Traffic: 1941 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6