Unfortunately this GFF file misuses the ###
directive so the
target_lines
parameter will not be a reliable way to break the file
keeping entire features together. Since GFF is a line oriented format
with references between lines handled by ID/Parent relationships, it
tries to intelligently break lines when entire features are together.
The ###
directive in the file tells a parser that it has completed
all nested features, so it would be safe to pass along finished
objects. The GFF3 specification has more details. This GFF3 file
issues those directives at the end of each transcript, which is a
problem for multi-transcript genes. For instance:
Chr1 MSU_osa1r6 gene 15292 19323 . + . ID=13101.t00004;Name=expressed%20protein;Alias=LOC_Os01g01040
Chr1 MSU_osa1r6 mRNA 15321 19323 . + . ID=13101.m00008;Parent=13101.t00004;Alias=LOC_Os01g01040.2
###
Chr1 MSU_osa1r6 mRNA 15321 19323 . + . ID=13101.m00007;Parent=13101.t00004;Alias=LOC_Os01g01040.3
###
The second mRNA references the first, but the ###
is telling the
parser it is okay to make a break here.
This why removing target_lines
fixes the issue. You are seeing
errors with all.gff3
because some of the mRNA features do not
have Alias keywords. For instance:
ChrSy MSU_osa1r6 mRNA 488 1166 . - . ID=13114.m00090;Parent=13114.t00002
If parsing the entire 'all.gff3` file uses too much memory a safe way
to break is by each chromosome. This is slower because it needs to
parse the file multiple times, but will guarantee correctly
reconstructed gene/mRNA/CDS features:
import sys
from BCBio import GFF
in_file = sys.argv[1]
examiner = GFF.GFFExaminer()
with open(in_file) as gff_handle:
possible_limits = examiner.available_limits(gff_handle)
chromosomes = sorted(possible_limits["gff_id"].keys())
print chromosomes
limits = dict(gff_type = ["gene","mRNA","CDS"])
for chrom in chromosomes:
with open(in_file) as gff_handle:
limits["gff_id"] = chrom
for rec in GFF.parse(gff_handle, limit_info=limits):
#Chromosome seq level
for gene_feature in rec.features:
#gene level
for mRNA_feature in gene_feature.sub_features:
#mRNA level
print mRNA_feature.type
try:
print mRNA_feature.qualifiers['Alias']
except:
print gene_feature
raise
Did you ask about this on the biopython mailing list?
Nope. I didn't join the mailing list.