Gff Parsing In Biopython Is Not So Perfect
2
1
Entering edit mode
13.1 years ago
Gahoo ▴ 270

I use BCBio.GFF to parse chr01.gff3 and all.gff3. But things didn't work out as I expect. Here's the code:

from BCBio import GFF
limits = dict(gff_type = ["gene","mRNA","CDS"])
gff_handle = open('chr01.gff3')
for rec in GFF.parse(gff_handle,target_lines=1000,limit_info=limits):
    #Chromosome seq level
    for gene_feature in rec.features:
        #gene level
        for mRNA_feature in gene_feature.sub_features:
            #mRNA level
            print mRNA_feature.type
            print mRNA_feature.qualifiers['Alias']

And I got:

Traceback (most recent call last):
  File "R:\Untitled 1.py", line 14, in <module>
    print mRNA_feature.qualifiers['Alias']
KeyError: 'Alias'

And the 'type' is "CDS" which is not correct. When parsing without

target_lines=1000

everything is ok. But parsing all.gff3 came to the same problem. Maybe all.gff3 is too huge to parse.

The problem might be due to the parser did not recognise the entry boudary correctly.

Any suggestions or alternative packages? I still want to do this in python.

python gff biopython • 12k views
ADD COMMENT
0
Entering edit mode

Did you ask about this on the biopython mailing list?

ADD REPLY
0
Entering edit mode

Nope. I didn't join the mailing list.

ADD REPLY
5
Entering edit mode
13.1 years ago

Unfortunately this GFF file misuses the ### directive so the target_lines parameter will not be a reliable way to break the file keeping entire features together. Since GFF is a line oriented format with references between lines handled by ID/Parent relationships, it tries to intelligently break lines when entire features are together.

The ### directive in the file tells a parser that it has completed all nested features, so it would be safe to pass along finished objects. The GFF3 specification has more details. This GFF3 file issues those directives at the end of each transcript, which is a problem for multi-transcript genes. For instance:

Chr1    MSU_osa1r6      gene    15292   19323   .       +       . ID=13101.t00004;Name=expressed%20protein;Alias=LOC_Os01g01040
Chr1    MSU_osa1r6      mRNA    15321   19323   .       +       . ID=13101.m00008;Parent=13101.t00004;Alias=LOC_Os01g01040.2
###
Chr1    MSU_osa1r6      mRNA    15321   19323   .       +       . ID=13101.m00007;Parent=13101.t00004;Alias=LOC_Os01g01040.3
###

The second mRNA references the first, but the ### is telling the parser it is okay to make a break here.

This why removing target_lines fixes the issue. You are seeing errors with all.gff3 because some of the mRNA features do not have Alias keywords. For instance:

ChrSy   MSU_osa1r6      mRNA    488     1166    .       -       . ID=13114.m00090;Parent=13114.t00002

If parsing the entire 'all.gff3` file uses too much memory a safe way to break is by each chromosome. This is slower because it needs to parse the file multiple times, but will guarantee correctly reconstructed gene/mRNA/CDS features:

import sys
from BCBio import GFF

in_file = sys.argv[1]

examiner = GFF.GFFExaminer()
with open(in_file) as gff_handle:
    possible_limits = examiner.available_limits(gff_handle)
chromosomes = sorted(possible_limits["gff_id"].keys())
print chromosomes
limits = dict(gff_type = ["gene","mRNA","CDS"])
for chrom in chromosomes:
    with open(in_file) as gff_handle:
        limits["gff_id"] = chrom
        for rec in GFF.parse(gff_handle, limit_info=limits):
            #Chromosome seq level
            for gene_feature in rec.features:
                #gene level
                for mRNA_feature in gene_feature.sub_features:
                    #mRNA level
                    print mRNA_feature.type
                    try:
                        print mRNA_feature.qualifiers['Alias']
                    except:
                        print gene_feature
                        raise
ADD COMMENT
2
Entering edit mode

Or to put it another way, GIGO. Did you try validating the GFF3?

ADD REPLY
0
Entering edit mode

Thanks for your answer. ### is the problem. But the reason paring all.gff goes wrong is not because of the mRNA features without 'Alias'. The Exception raise before the parser gets to ChrSy. On my computer, 'LOC_Os09g40085.1' is the last correct one. The problem is similar with the one when pasring chr01.gff. But this time it might cause by the memory limits that make it parse the file incompletely.

ADD REPLY
0
Entering edit mode

Can you do a try/except on the qualifiers['Alias'] call and print out the problem mRNA_feature, like in the code example? When I do that the ChrSy line is the problem. The parser will not change behavior based on memory usage; instead you'd slow down and start swapping or get a python memory error.

ADD REPLY
1
Entering edit mode
13.1 years ago

Can you please report this error to Biopython's bug tracker on redmine?

ADD COMMENT
0
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 2035 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6