Entering edit mode
9.3 years ago
espop23
▴
60
I have data from gencode which looks like this:
chr1 ENSEMBL gene 17369 17436 . - . gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; level 3;
chr1 ENSEMBL gene 30366 30503 . + . gene_id "ENSG00000274890.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR1302-2"; level 3;
chr1 ENSEMBL gene 157784 157887 . - . gene_id "ENSG00000222623.1"; gene_type "snRNA"; gene_status "KNOWN"; gene_name "RNU6-1100P"; level 3;
I have tried using gffutils, but I get an error with this code:
import gffutils
db = gffutils.create_db("sRNA.gene.gtf", dbfn='sRNA.gene.gtf.db')
print(list(db.featuretypes()))
# ['CDS', 'exon', 'gene', 'start_codon', 'stop_codon', 'transcript']
# Here's how to write genes out to file
with open('sRNA.gene.gtf', 'w') as fout:
for gene in db.features_of_type('gene'):
fout.write(str(gene) + '\n')
Where it says
ImportError: cannot import name 'feature'.
Can someone please offer suggestions on the best way to parse such GTF files?
If I use your example GTF file and your example code, it works -- with the exception that the list of featuretypes is ['gene'] since only gene features are in your example GTF.
Can you provide a minimal example (complete code and input) that reproduces the error?
More generally, what is your end goal? It may not be necessary to create a database. For example, you can use gffutils just for parsing a GTF file (with the
gffutils.FeatureIterator
class).Last, see some hints at A: GFFutils very slow at creating database file. Any Idea why..? for using GENCODE GTF files which now already include features for genes and transcripts.
Hello espop23!
It appears that your post has been cross-posted to another site: https://www.reddit.com/r/bioinformatics/comments/3rvn3g/help_parsing_gtf_file/
This is typically not recommended as it runs the risk of annoying people in both communities.