Extract last exon annotation from each mRNA
1
3
Entering edit mode
4.7 years ago
kashiff007 ★ 1.9k

I am working with GFF file, and found GFFutils is great tools to extract information from it. Although going through manual I am unable to extract the last exon's coordination from each mRNA.

GFFUtils • 1.7k views
ADD COMMENT
1
Entering edit mode

Which GFF are you using? What you want can likely be done outside of GFFutils, i.e., via grep -B, cut, et cetera

ADD REPLY
2
Entering edit mode

Here is sample of GFF I am using Kevin.

Chr1    TAIR10  chromosome  1   30427671    .   .   .   ID=Chr1;Name=Chr1
Chr1    TAIR10  gene    3631    5899    .   +   .   ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1    TAIR10  mRNA    3631    5899    .   +   .   ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1    TAIR10  protein 3760    5630    .   +   .   ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1    TAIR10  exon    3631    3913    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  five_prime_UTR  3631    3759    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 3760    3913    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    3996    4276    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 3996    4276    .   +   2   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    4486    4605    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 4486    4605    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    4706    5095    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 4706    5095    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    5174    5326    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 5174    5326    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    5439    5899    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 5439    5630    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  three_prime_UTR 5631    5899    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  gene    5928    8737    .   -   .   ID=AT1G01020;Note=protein_coding_gene;Name=AT1G01020
Chr1    TAIR10  mRNA    5928    8737    .   -   .   ID=AT1G01020.1;Parent=AT1G01020;Name=AT1G01020.1;Index=1
Chr1    TAIR10  protein 6915    8666    .   -   .   ID=AT1G01020.1-Protein;Name=AT1G01020.1;Derives_from=AT1G01020.1
Chr1    TAIR10  five_prime_UTR  8667    8737    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 8571    8666    .   -   0   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  exon    8571    8737    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 8417    8464    .   -   0   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  exon    8417    8464    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 8236    8325    .   -   0   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  exon    8236    8325    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 7942    7987    .   -   0   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  exon    7942    7987    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 7762    7835    .   -   2   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  exon    7762    7835    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 7564    7649    .   -   0   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  exon    7564    7649    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 7384    7450    .   -   1   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  exon    7384    7450    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 7157    7232    .   -   0   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  exon    7157    7232    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 6915    7069    .   -   2   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  three_prime_UTR 6437    6914    .   -   .   Parent=AT1G01020.1

I am very much convinced with GFFutils because I need to perform some another tasks also which is described in this post. For instance there is no intron (space between two consecutive exons) information in this gff file but GFFutils will calculate that also. It can be done by basic unix commands but the logics will become super-complicated.

ADD REPLY
2
Entering edit mode
4.7 years ago
liorglic ★ 1.4k

What have you tried so far? Let's see the code...

ADD COMMENT
1
Entering edit mode

Hi liorglic, thanks for suggesting me this tool. I have tried and able to print the length of all exons from all genes, as I mentined in my previous post that I need only first and last 3 exons' and introns' length information.

Here is my code:

import GFFutils

G = GFFutils.GFFDB('dm3.db')
exon1_count = 0
gene_count = 0
for gene in G.features_of_type('gene'):
    gene_exon_count = 0
    print(gene)
    next = []
    # get all grandchildren, only counting the exons
    for child in G.childrengene.id,2):
        if child.featuretype == 'exon':
            next.append(child)

            print(len(child))
#            if len(next) == 3: 
#               break
#            gene_exon_count += 1

I tried to used break and continue to get till third exon but unable to get last three.

ADD REPLY
1
Entering edit mode

How about something like this?

import GFFutils

G = GFFutils.GFFDB('dm3.db')
for gene in G.features_of_type('gene'):
    exons = list(G.children(gene, featuretype='exon'))
    first_3_exons = exons[:3]
    last_3_exons = exons[-3:]
ADD REPLY
1
Entering edit mode

This works very fine just with change of database name from db.children to G.children. Thanks

ADD REPLY
0
Entering edit mode

Sorry - my bad. Fixed.

ADD REPLY

Login before adding your answer.

Traffic: 2978 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6