How to filter and keep longest isoform per gene in gff3 file ?
2
1
Entering edit mode
5 days ago
Sony ▴ 20

Hi everyone,

I am doing genome annotation for Oryza sativa by following the MAKER annotation pipeline. In the configuration file of maker pipeline, I turn on "alternative splicing". here is my final gff3 file (combined homology-based cDNA EST evidence and Ab initio prediction: snap, Augustus):

head maker_annotation05.gff
##gff-version 3
Oj_ERR3890993_253       maker   gene    39      122     .       -       .       ID=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1;Name=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1
Oj_ERR3890993_253       maker   mRNA    39      122     .       -       .       ID=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1-mRNA-1;Parent=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1;Name=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1-mRNA-1;_AED=0.28;_eAED=0.28;_QI=0|-1|0|1|-1|0|1|0|28;score=27.91839
Oj_ERR3890993_253       maker   exon    39      122     .       -       .       ID=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1-mRNA-1:1;Parent=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1-mRNA-1
Oj_ERR3890993_253       maker   CDS     39      122     .       -       0       ID=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1-mRNA-1:cds;Parent=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1-mRNA-1
Oj_ERR3890993_253       maker   gene    316     1137    .       +       .       ID=maker-Oj_ERR3890993_253-augustus-gene-0.1;Name=maker-Oj_ERR3890993_253-augustus-gene-0.1
Oj_ERR3890993_253       maker   mRNA    316     1137    .       +       .       ID=maker-Oj_ERR3890993_253-augustus-gene-0.1-mRNA-1;Parent=maker-Oj_ERR3890993_253-augustus-gene-0.1;Name=maker-Oj_ERR3890993_253-augustus-gene-0.1-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|0|0|0.33|0|0|3|0|182
Oj_ERR3890993_253       maker   exon    316     594     .       +       .       ID=maker-Oj_ERR3890993_253-augustus-gene-0.1-mRNA-1:1;Parent=maker-Oj_ERR3890993_253-augustus-gene-0.1-mRNA-1
Oj_ERR3890993_253       maker   exon    790     970     .       +       .       ID=maker-Oj_ERR3890993_253-augustus-gene-0.1-mRNA-1:2;Parent=maker-Oj_ERR3890993_253-augustus-gene-0.1-mRNA-1
Oj_ERR3890993_253       maker   exon    1049    1137    .       +       .       ID=maker-Oj_ERR3890993_253-augustus-gene-0.1-mRNA-1:3;Parent=maker-Oj_ERR3890993_253-augustus-gene-0.1-mRNA-1

cat ../maker_annotation05.gff | grep -v "^#" | awk '{print $3}' | sort | uniq -c | sort -nr 394975 CDS 243395 exon 83406 mRNA 48897 three_prime_UTR 41785 five_prime_UTR 36786 gene

I want to keep only the longest isoform for each gene in my gff3 file . I have tried with some script from chat_GPT, but the results are inconsistent: after filtering, the number of gene is not equal to the number of mRNA. Does any one know how to do that ? Many Thanks,

gff isoform maker annotation • 251 views
ADD COMMENT
1
Entering edit mode
5 days ago

Perhaps this can help:

Usage example:

$ wget -qO- "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz" | gunzip -c > gencode.v19.annotation.gff3
$ ./longest_isoform_from_gff3.py gencode.v19.annotation.gff3 | sort -k1,1 -k4,4n -k5,5n > gencode.v19.annotation.longestIsoform.gff3
ADD COMMENT
1
Entering edit mode
5 days ago
Juke34 8.9k

If you talk about longest in term of isoform with the longest CDS a common way is to use AGAT

agat_sp_keep_longest_isoform.pl -gff file.gff  -o outfile 

Here is the doc: https://nbisweden.github.io/AGAT/tools/agat_sp_keep_longest_isoform/

ADD COMMENT
0
Entering edit mode

Many thanks for your help, It's work with my data

ADD REPLY

Login before adding your answer.

Traffic: 1835 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6