Hi guys,
I've been exploring gffutils lately. All is well when I work with small gff file (less then a Mb in size). However I'm trying to make a database file with Mus_musculus.GRCm38.81.gff(gtf) files and it takes very long time.. The gtf file is around Gb in size. I left it running and it was still running after a day (like 25 hours). I ctrl^c it. Right now trying to make database file using GFF file instead and its been running for few hours thus far. GFF is around 300 Mb in size.
Here is the command I'm running
db = gffutils.create_db(myGFF, dbfn='Mus_musculus_GFF.db', force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True)
Am I doing something wrong or it can take days to make a database file?
by the way PC specs are: i7-4600U CPU @ 2.10GHz with 16 Gb of RAM.
Also can I multi thread and will it help..?
Thanks,
Kirill
It really shouldn't take longer than an hour. For example, database creation for the full human GENCODE GTF takes 20 mins.
Can you post a link to the file(s) you're using?
@Ryan you can pull exact files from here http://bioinformatics.erc.monash.edu/home/kirill/check/Mus_musculus.GRCm38.81.gff3
http://bioinformatics.erc.monash.edu/home/kirill/check/Mus_musculus.GRCm38.81.gff3
I actually got those files from Ensembl ftp://ftp.ensembl.org/pub/release-81/gtf/mus_musculus/
Thanks
P.S Those links are temporal. I'll remove at some point
@Ryan just a quick update. So database file generation have been running for about 5-6 hours now and hasn't finished yet. The current database file is ~ 150 Mb. I think database file roughly is the double of the original gff. (It was the case with previous gff's I tried). If this the case then original gff is 250 Mb I then expect db file aroung 500 Mb. Still long way to go.. I'll way until tomorrow morning and might kill it then. Either way if gtf/gff file is Gb in size at this rate it'll be too long to make db file.
Thanks