GFFutils very slow at creating database file. Any Idea why..?

0

Entering edit mode

10.1 years ago

Kirill Tsyganov ▴ 370

Hi guys,

I've been exploring gffutils lately. All is well when I work with small gff file (less then a Mb in size). However I'm trying to make a database file with Mus_musculus.GRCm38.81.gff(gtf) files and it takes very long time.. The gtf file is around Gb in size. I left it running and it was still running after a day (like 25 hours). I ctrl^c it. Right now trying to make database file using GFF file instead and its been running for few hours thus far. GFF is around 300 Mb in size.

Here is the command I'm running

db = gffutils.create_db(myGFF, dbfn='Mus_musculus_GFF.db', force=True, keep_order=True, merge_strategy='merge',     sort_attribute_values=True)

Am I doing something wrong or it can take days to make a database file?

by the way PC specs are: i7-4600U CPU @ 2.10GHz with 16 Gb of RAM.

Also can I multi thread and will it help..?

Thanks,
Kirill

gffutils gff • 8.1k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 10.1 years ago by Kirill Tsyganov ▴ 370

0

Entering edit mode

It really shouldn't take longer than an hour. For example, database creation for the full human GENCODE GTF takes 20 mins.

Can you post a link to the file(s) you're using?

ADD REPLY • link 10.1 years ago by Ryan Dale 5.0k

0

Entering edit mode

@Ryan you can pull exact files from here http://bioinformatics.erc.monash.edu/home/kirill/check/Mus_musculus.GRCm38.81.gff3

http://bioinformatics.erc.monash.edu/home/kirill/check/Mus_musculus.GRCm38.81.gff3

I actually got those files from Ensembl ftp://ftp.ensembl.org/pub/release-81/gtf/mus_musculus/

Thanks

P.S Those links are temporal. I'll remove at some point

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 10.1 years ago by Kirill Tsyganov ▴ 370

0

Entering edit mode

@Ryan just a quick update. So database file generation have been running for about 5-6 hours now and hasn't finished yet. The current database file is ~ 150 Mb. I think database file roughly is the double of the original gff. (It was the case with previous gff's I tried). If this the case then original gff is 250 Mb I then expect db file aroung 500 Mb. Still long way to go.. I'll way until tomorrow morning and might kill it then. Either way if gtf/gff file is Gb in size at this rate it'll be too long to make db file.

Thanks

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 10.1 years ago by Kirill Tsyganov ▴ 370

2

Entering edit mode

10.1 years ago

Ryan Dale 5.0k

Using the script below, called as

python make-gtf-db.py Mus_musculus.GRCm38.81.gtf.gz GRCm38.81.db

it took 12 minutes to create a database out of the 1.5M features in the GTF file.

Turns out Ensembl's GTF files don't follow the GTF spec, so gffutils needs some hints to avoid spending a ton of time merging features due to name collisions. I've commented the script pretty heavily to illustrate my reasoning for the options I used.

	"""
	https://www.biostars.org/p/152517/

	Example of how to work with Ensembl release 81 GTF files, which:

	1) already have genes and transcripts included

	2) have unique IDs for genes, transcripts, and exons in the corresponding
	"<featuretype>_id" attribute

	3) do not have unique IDs for CDS, stop_codon, start_codon, UTR.

	See background info at on database IDs at:
	https://pythonhosted.org/gffutils/database-ids.html


	GTF file from

	ftp://ftp.ensembl.org/pub/release-81/gtf/mus_musculus/Mus_musculus.GRCm38.81.gtf.gz

	"""

	import gffutils
	import time
	import argparse

	ap = argparse.ArgumentParser()
	ap.add_argument('input', help='Ensembl Release 81 GTF file')
	ap.add_argument('database', help='Database to be created')
	ap.add_argument('--testN', type=int,
	help='Run a test using only the first N features, and then '
	'print out some example feature IDs and their attributes')
	ap.add_argument('--force', action='store_true',
	help='Overwrite an existing database')
	args = ap.parse_args()


	def first_n_features(data, n=5000):
	"""
	Useful for testing: only use the first `n` features of source data
	"""
	for i, feature in enumerate(gffutils.iterators.DataIterator(data)):
	if i > n:
	break
	yield feature


	# Note: this function is optional; if you don't want these IDs then comment out
	# the lines at [1] below
	def subfeature_handler(f):
	"""
	Given a gffutils.Feature object (which does not yet have its ID assigned),
	figure out what its ID should be.

	This is intended to be used for CDS, UTR, start_codon, and stop_codon
	features in the Ensembl release 81 GTF files. I figured a reasonable
	unique ID would consist of the parent transcript and the feature type,
	followed by an autoincrementing number.

	See https://pythonhosted.org/gffutils/database-ids.html#id-spec for
	details and other options.
	"""
	return ''.join(
	['autoincrement:',
	f.attributes['transcript_id'][0],
	'_',
	f.featuretype])


	# gffutils can spend a lot of time trying to decide on a unique ID for each
	# feature. So we have to give it hints of where to look in the attributes.
	#
	# We also tell it to use our subfeature_handler function for featuretypes with
	# no unique IDs.
	id_spec = {
	'exon': 'exon_id',
	'gene': 'gene_id',
	'transcript': 'transcript_id',

	# [1] These aren't needed for speed, but they do give nicer IDs.
	'CDS': [subfeature_handler],
	'stop_codon': [subfeature_handler],
	'start_codon': [subfeature_handler],
	'UTR': [subfeature_handler],
	}

	if args.testN is None:
	data = args.input
	else:
	data = first_n_features(args.input, args.testN)

	t0 = time.time()
	db = gffutils.create_db(
	data,
	args.database,

	# Since Ensembl GTF files now come with genes and transcripts already in
	# the file, we don't want to spend the time to infer them (which we would
	# need to do in an on-spec GTF file)
	disable_infer_genes=True,
	disable_infer_transcripts=True,

	# Here's where we provide our custom id spec
	id_spec=id_spec,

	# "create_unique" runs a lot faster than "merge"
	# See https://pythonhosted.org/gffutils/database-ids.html#merge-strategy
	# for details.
	merge_strategy='create_unique',
	verbose=True,
	force=args.force,
	)
	t1 = time.time()

	if args.testN is not None:
	for ex in ['CDS', 'start_codon', 'stop_codon', 'UTR']:
	c = db.features_of_type(ex).next()
	print '-----------------------------'
	print 'Example "{0}" feature; ID="{1}"'.format(ex, c.id)
	print 'Attributes:\n{0}'.format(c.attributes)

	print "\n\n{0:.1f}s to create database".format(t1 - t0)

view raw make-gtf-db.py hosted with ❤ by GitHub

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 10.1 years ago by Ryan Dale 5.0k

0

Entering edit mode

@Ryan, thanks for code and effort, I really do appreciate it, but firstly I really don't think GTF format is the problem..I think Ensembl does follow GFT file standards. Secondly though what are the standards ..? We sort of had this discussion previously here A: Ensemble GTF format: isn't the tag "transcript_id" mandatory?. I've been doing a bit of research and here are the most common sites that people refer to as GFF/GTF specifications..

From what I understand the only major difference between GFF 1,2,3 and GTF is the 9th column formatting. The tag and value pairs. It worth to mention that features type can also differ (but I don't think its that important at all. I actually never have seen start and stop _codon features). According to Sanger site tag names can be *Tags must be standard identifiers ([A-Za-z][A-Za-z0-9_]*). *My understanding that means any text, at least for GFF3 format... Whereas according to GENCODE there are pre-defined number of tag that you must have, that's for GTF only format.

I've noticed that Ensembl had only released GFF3 format in the latest release (that is release 81). I looked back a few releases and they have only ever released GTF files until now.

The reason I'm writing it all out here is because GFF/GTF files are pain in the back work with, but unavoidable.. You did mention once before about writing to GA4GH in order to standardise the format. I'm all for it.

I git cloned your code snippet, but it doesn't work. Here is the command I'm running

python make-gtf-db.py ~/ref-files/Mammalian/Mus_musculus/Mus_musculus.GRCm38.80.gtf t.db

Here is the error I'm getting

Traceback (most recent call last):
  File "make-gtf-db.py", line 111, in <module>
    force=args.force,
TypeError: create_db() got an unexpected keyword argument 'disable_infer_genes'

I don't know if I can do that, but if i comment out lines 100 and 101 (because both cause the error) the script runs for 10-15 minutes fine (I actually see t.db file grows in size very quickly to 1.4 Gb, but then it seats at this forever

2015-07-31 08:58:05,082 - INFO - Committing changes: 1524000 features  
2015-07-31 08:58:10,510 - INFO - Populating features table and first-order relations: 1524099 features  
2015-07-31 08:58:10,510 - INFO - Creating relations(parent) index  
2015-07-31 08:58:41,345 - INFO - Creating relations(child) index  
2015-07-31 08:59:12,226 - INFO - Inferring gene and transcript extents, and writing to tempfile  
2015-07-31 09:01:49,873 - INFO - Importing inferred features into db  
0 of 433220 (0%)

Ultimate goal for me is to have one reliable annotation format that I can easily parse and get what I need. Right now I need to parse GFF such that I need to grab all features that belong to a gene and kinder do some quick calculations and merging of different exons.. My script works ok, but I'd much rather utilise gffutils tool. I really like database idea central to gffutils..

Here are a couple of some what related link for others reference on the GFF/GTF matter:

Yet another randomly formatted GFF file A: RNASeqQC - Low number of genes detected on run
Good discussion about large GFF files and how to get quick feature access Light-Weight Script-Based Random Access To Large Gff3 Files Using Biopython

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 10.1 years ago by Kirill Tsyganov ▴ 370

0

Entering edit mode

The disable_infer_genes and disable_infer_transcripts are new features in the latest version (0.8.4rc1) to accommodate GTF files that only had gene or transcript -- but not both -- already included See https://github.com/daler/gffutils/issues/48 for details.

Updating should fix the issue. If you don't want to update your version of gffutils, replace disable_infer_genes=True, disable_infer_transcripts=True with infer_gene_extent=False in my example to get the equivalent behavior for your file.

As for the larger point about GTF and GFF:

It's true that the syntax of GTF and GFF only differ by the 9th field. But semantically they can be quite different. The biggest difference is that GFF doesn't guarantee that gene ID is specified in each exon (requiring traversing the exon-transcript-gene DAG to figure it out); GTF doesn't guarantee that gene and transcript features are included in the file. Actual contents in actual GTF or GFF files vary wildly, and supporting all of those various flavors is what gffutils is for.

The specs you link to are actually a mix of GTF and GFF. At the time gffutils was written, people had converged on the second of your links to be the GTF spec, which does not include genes and transcripts. One of gffutils' original jobs was to infer the full extent of transcripts and genes based on such files that only provided exons.

Recently, providers like Ensembl have started to include gene and transcript features. That's convenient, but since it's not the canonical GTF format, gffutils' assumptions are wrong.

The extent to which Ensembl becomes the de facto GTF format is the extent to which gffutils should change the assumptions about GTF format. I could be convinced to change the assumptions.

However, my goal for gffutils is to maintain the flexibility to handle any sort of semantics you can cram into a GTF or GFF file. For example, when creating a database we need unique primary keys. But the Ensembl GTF files don't provide unique IDs for exons. gffutils was trying to use the "gene_id" as the exon ID, but since so many exons share the same gene ID it was spending a lot of time reconciling those features. The fix in this example script was to provide hints to create a unique exon ID, drastically speeding things up.

If you'd like more guidance ot have suggestions, I'd encourage you to submit an issue in the github repo, https://github.com/daler/gffutils. It might be better suited for discussion than the comments of a biostar post!

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 10.1 years ago by Ryan Dale 5.0k

0

Entering edit mode

@Ryan thanks, the latest version works much better. I'm sure you know, but you don't get the latest version through pip install.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 10.1 years ago by Kirill Tsyganov ▴ 370

0

Entering edit mode

Yep, the latest version has an "rc1" at the end of the version which won't be automatically installed by pip. But it's been like that for a while, thanks for reminding me to upload a new final release.

I've just uploaded to PyPI v0.8.5 (no changes from 0.8.4rc1); you should be able to pip install upgrade now.

ADD REPLY • link 10.0 years ago by Ryan Dale 5.0k

0

Entering edit mode

10.1 years ago

Matt Shirley 10k

For what it's worth, several people have reported a similar experience. Maybe you want to add to the issue? https://github.com/daler/gffutils/issues/20

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 10.1 years ago by Matt Shirley 10k

0

Entering edit mode

great, I'll consider this.

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 10.1 years ago by Kirill Tsyganov ▴ 370

Login before adding your answer.