How to filter and keep longest isoform per gene in gff3 file ?
2
1
Entering edit mode
3 months ago
Sony ▴ 20

Hi everyone,

I am doing genome annotation for Oryza sativa by following the MAKER annotation pipeline. In the configuration file of maker pipeline, I turn on "alternative splicing". here is my final gff3 file (combined homology-based cDNA EST evidence and Ab initio prediction: snap, Augustus):

head maker_annotation05.gff
##gff-version 3
Oj_ERR3890993_253       maker   gene    39      122     .       -       .       ID=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1;Name=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1
Oj_ERR3890993_253       maker   mRNA    39      122     .       -       .       ID=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1-mRNA-1;Parent=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1;Name=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1-mRNA-1;_AED=0.28;_eAED=0.28;_QI=0|-1|0|1|-1|0|1|0|28;score=27.91839
Oj_ERR3890993_253       maker   exon    39      122     .       -       .       ID=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1-mRNA-1:1;Parent=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1-mRNA-1
Oj_ERR3890993_253       maker   CDS     39      122     .       -       0       ID=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1-mRNA-1:cds;Parent=maker-Oj_ERR3890993_253-exonerate_protein2genome-gene-0.1-mRNA-1
Oj_ERR3890993_253       maker   gene    316     1137    .       +       .       ID=maker-Oj_ERR3890993_253-augustus-gene-0.1;Name=maker-Oj_ERR3890993_253-augustus-gene-0.1
Oj_ERR3890993_253       maker   mRNA    316     1137    .       +       .       ID=maker-Oj_ERR3890993_253-augustus-gene-0.1-mRNA-1;Parent=maker-Oj_ERR3890993_253-augustus-gene-0.1;Name=maker-Oj_ERR3890993_253-augustus-gene-0.1-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|0|0|0.33|0|0|3|0|182
Oj_ERR3890993_253       maker   exon    316     594     .       +       .       ID=maker-Oj_ERR3890993_253-augustus-gene-0.1-mRNA-1:1;Parent=maker-Oj_ERR3890993_253-augustus-gene-0.1-mRNA-1
Oj_ERR3890993_253       maker   exon    790     970     .       +       .       ID=maker-Oj_ERR3890993_253-augustus-gene-0.1-mRNA-1:2;Parent=maker-Oj_ERR3890993_253-augustus-gene-0.1-mRNA-1
Oj_ERR3890993_253       maker   exon    1049    1137    .       +       .       ID=maker-Oj_ERR3890993_253-augustus-gene-0.1-mRNA-1:3;Parent=maker-Oj_ERR3890993_253-augustus-gene-0.1-mRNA-1

cat ../maker_annotation05.gff | grep -v "^#" | awk '{print $3}' | sort | uniq -c | sort -nr 394975 CDS 243395 exon 83406 mRNA 48897 three_prime_UTR 41785 five_prime_UTR 36786 gene

I want to keep only the longest isoform for each gene in my gff3 file . I have tried with some script from chat_GPT, but the results are inconsistent: after filtering, the number of gene is not equal to the number of mRNA. Does any one know how to do that ? Many Thanks,

gff isoform maker annotation • 537 views
ADD COMMENT
1
Entering edit mode
3 months ago

Perhaps this can help:

#!/usr/bin/env python
import sys
import pandas as pd
i_fn = sys.argv[1]
if not i_fn:
raise Exception("Error: Missing input file")
df = pd.read_csv(i_fn, delimiter='\t', header=None, comment='#')
# https://useast.ensembl.org/info/website/upload/gff3.html
df.columns = ['seqid',
'source',
'type',
'start',
'end',
'score',
'strand',
'phase',
'attributes']
transcript_ids = []
longest_isoform = {}
transcript_id_to_gene_id = {}
for index, row in df.iterrows():
attrs = {a_kv[0]:a_kv[1] for a_kv in [a_kvs.split('=') for a_kvs in row['attributes'].split(';')]}
transcript_id = attrs['transcript_id'] if 'transcript_id' in attrs else None
ensg_id = attrs['gene_id'] if 'gene_id' in attrs else None
if not transcript_id or not ensg_id: continue
if row['type'] == 'transcript':
txLength = int(row['end']) - int(row['start'])
if ensg_id not in longest_isoform:
longest_isoform[ensg_id] = {'row': row, 'txLength': txLength, 'transcript_id': transcript_id}
transcript_ids.append(transcript_id)
transcript_id_to_gene_id[transcript_id] = ensg_id
elif longest_isoform[ensg_id]['txLength'] < txLength:
transcript_id_to_filter = longest_isoform[ensg_id]['transcript_id']
idx_to_replace = transcript_ids.index(longest_isoform[ensg_id]['transcript_id'])
transcript_ids[idx_to_replace] = transcript_id
longest_isoform[ensg_id] = {'row': row, 'txLength': txLength, 'transcript_id': transcript_id}
transcript_id_to_gene_id[transcript_id] = ensg_id
for transcript_id in transcript_ids:
gene_id = transcript_id_to_gene_id[transcript_id]
for column in df.columns:
print(longest_isoform[gene_id]['row'][column], end="\t")
print()

Usage example:

$ wget -qO- "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz" | gunzip -c > gencode.v19.annotation.gff3
$ ./longest_isoform_from_gff3.py gencode.v19.annotation.gff3 | sort -k1,1 -k4,4n -k5,5n > gencode.v19.annotation.longestIsoform.gff3
ADD COMMENT
1
Entering edit mode
3 months ago
Juke34 9.1k

If you talk about longest in term of isoform with the longest CDS a common way is to use AGAT

agat_sp_keep_longest_isoform.pl -gff file.gff  -o outfile 

Here is the doc: https://nbisweden.github.io/AGAT/tools/agat_sp_keep_longest_isoform/

ADD COMMENT
0
Entering edit mode

Many thanks for your help, It's work with my data

ADD REPLY

Login before adding your answer.

Traffic: 2416 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6