I am wondering about the best practice in Biopython to work with large GFF3 files (>100 MB). As minimum requirement, I want to be able to quickly and randomly retrieve features by ID, name, and coordinates, and returned feature objects (SeqRecords?) should have the hierarchical feature structure (e.g. gene -> mRNA -> CDS) correctly mapped onto them. Write access would be nice but is optional. External dependencies should be kept minimal to increase portability to other systems.
I am currently thinking along the line of importing the GFF3 file into a BioSQL database (How? Can I use a SQLite database?) and then access this database with Biopython's BioSQL interface. GFFutils looks useful, but I would prefer to work with Biopython objects (e.g. Bio.SeqRecord).
I know how to do all this in BioPerl bp_seqfeature_load.pl and Bio::DB::SeqFeature::Store), but I am new to Biopython.
Edit: I changed the title a bit to better reflect my requirements.
From its table schema, GFFUtils does not seem to use R-tree or binning. It probably won't achieve the speed you need. BioSQL supports the SQLite backend and does the schema part right. I guess by bioperl::store being slow you mean the parsing speed is slow (like always), but not the retrieval speed is slow? UCSC uses similar techniques to host terabytes of data. If I were to do this task, I would write the whole thing by myself. It might be a good one to learn something new if you do not know yet how BioSQL/UCSC achieve fast regional queries. As to the interface to BioPython, you can generate GFF lines from the database and let BioPython parse them. You naturally get Bio.SeqRecord.
gffutils author here. OP's link is actually to a fork of an old version -- see https://github.com/daler/gffutils instead.
@lh3, adding UCSC binning should be trivial, though allowing backwards compatibility with existing gffutils databases will be a little awkward. But I should look into BioSQL's schema to help improve that of gffutils.
@Christian, a Bio.SeqRecord adapter should also be straightforward to add and would be generally useful -- I'll add it to the todo list.
For a large file with millions of lines, a spatial index such as R-tree or UCSC binning is a must-have. As to the backward compatibility, you can append a "bin" column to the old version of the table when it is absent. Sqlite also supports the R-tree index, but it is not compiled by default. Probably it would be better not to rely on this feature. As to BioSQL, I said it is right only because it uses binning. I more like to stuff the entire GFF in one table as GFFUtils.
Yes, by BioPerl's SeqFeature::Store being slow I meant the parsing part. But as I said, this only becomes an issue if you programmatically retrieve thousands of database features, which is why SeqFeature::Store's performance is in general sufficient for genome browsers where you usually look at small genomic regions at a time containing only few features.
Generating Bio.SeqRecords from BioSQL via a GFF intermediate could work, but is likely not very efficient because of the parsing overhead that comes with it. I am wondering if serialization/deserialization would be faster, i.e. one could store a serialized version of the Bio.SeqRecord object in the database and just deserialize it upon retrieval.
If I go with the BioSQL solution, I would have to import the GFF into BioSQL first. What would be the best/fastest way to load a GFF file into a BioSQL-schema database? So far I am only aware of the gff_to_biosql.py script from Brad Chapman available from GitHub.