Question

Gff Parsing In Biopython Is Not So Perfect

1

Entering edit mode

13.2 years ago

Gahoo ▴ 270

I use BCBio.GFF to parse chr01.gff3 and all.gff3. But things didn't work out as I expect. Here's the code:

from BCBio import GFF
limits = dict(gff_type = ["gene","mRNA","CDS"])
gff_handle = open('chr01.gff3')
for rec in GFF.parse(gff_handle,target_lines=1000,limit_info=limits):
    #Chromosome seq level
    for gene_feature in rec.features:
        #gene level
        for mRNA_feature in gene_feature.sub_features:
            #mRNA level
            print mRNA_feature.type
            print mRNA_feature.qualifiers['Alias']

And I got:

Traceback (most recent call last):
  File "R:\Untitled 1.py", line 14, in <module>
    print mRNA_feature.qualifiers['Alias']
KeyError: 'Alias'

And the 'type' is "CDS" which is not correct. When parsing without

target_lines=1000

everything is ok. But parsing all.gff3 came to the same problem. Maybe all.gff3 is too huge to parse.

The problem might be due to the parser did not recognise the entry boudary correctly.

Any suggestions or alternative packages? I still want to do this in python.

python gff biopython • 12k views

ADD COMMENT • link updated 13.2 years ago by Brad Chapman 9.7k • written 13.2 years ago by Gahoo ▴ 270

0

Entering edit mode

Did you ask about this on the biopython mailing list?

ADD REPLY • link 13.2 years ago by Scott Cain ▴ 770

0

Entering edit mode

Nope. I didn't join the mailing list.

ADD REPLY • link 13.2 years ago by Gahoo ▴ 270

score 5 · Answer 1 · 2011-10-28

Unfortunately this GFF file misuses the ### directive so the target_lines parameter will not be a reliable way to break the file keeping entire features together. Since GFF is a line oriented format with references between lines handled by ID/Parent relationships, it tries to intelligently break lines when entire features are together.

The ### directive in the file tells a parser that it has completed all nested features, so it would be safe to pass along finished objects. The GFF3 specification has more details. This GFF3 file issues those directives at the end of each transcript, which is a problem for multi-transcript genes. For instance:

Chr1    MSU_osa1r6      gene    15292   19323   .       +       . ID=13101.t00004;Name=expressed%20protein;Alias=LOC_Os01g01040
Chr1    MSU_osa1r6      mRNA    15321   19323   .       +       . ID=13101.m00008;Parent=13101.t00004;Alias=LOC_Os01g01040.2
###
Chr1    MSU_osa1r6      mRNA    15321   19323   .       +       . ID=13101.m00007;Parent=13101.t00004;Alias=LOC_Os01g01040.3
###

The second mRNA references the first, but the ### is telling the parser it is okay to make a break here.

This why removing target_lines fixes the issue. You are seeing errors with all.gff3 because some of the mRNA features do not have Alias keywords. For instance:

ChrSy   MSU_osa1r6      mRNA    488     1166    .       -       . ID=13114.m00090;Parent=13114.t00002

If parsing the entire 'all.gff3` file uses too much memory a safe way to break is by each chromosome. This is slower because it needs to parse the file multiple times, but will guarantee correctly reconstructed gene/mRNA/CDS features:

import sys
from BCBio import GFF

in_file = sys.argv[1]

examiner = GFF.GFFExaminer()
with open(in_file) as gff_handle:
    possible_limits = examiner.available_limits(gff_handle)
chromosomes = sorted(possible_limits["gff_id"].keys())
print chromosomes
limits = dict(gff_type = ["gene","mRNA","CDS"])
for chrom in chromosomes:
    with open(in_file) as gff_handle:
        limits["gff_id"] = chrom
        for rec in GFF.parse(gff_handle, limit_info=limits):
            #Chromosome seq level
            for gene_feature in rec.features:
                #gene level
                for mRNA_feature in gene_feature.sub_features:
                    #mRNA level
                    print mRNA_feature.type
                    try:
                        print mRNA_feature.qualifiers['Alias']
                    except:
                        print gene_feature
                        raise

Ram · Answer 2 · 2011-10-28

1

Entering edit mode

13.2 years ago

Giovanni M Dall'Olio 28k

Can you please report this error to Biopython's bug tracker on redmine?

ADD COMMENT • link 13.2 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Error reported.

https://redmine.open-bio.org/issues/3311

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 13.2 years ago by Gahoo ▴ 270