Extract features from GFF file
1
Hi pals,
I have a genome in GTF format like this:
CP014038.1 GeneMarkS+ CDS 3717912 3718988 . - 0 "ID=cds0;Parent=gene0;Dbxref=NCB...."
CP014038.1 Genbank gene 631 2190 . - . "ID=gene1;Name=AL538_00010;gbkey=....."
Is there a way to extract the features i want (locus_tag, Name...) from the last column and make it look like this?
CP014038.1 GeneMarkS+ CDS 3717912 3718988 . - 0 "locus_tag=AL34598_3409; Name=N/A;....."
Thanks for your help
Ander
genome
sequence
gene
• 4.0k views
You could use the following GTF-processing skeleton, which extracts the attributes
column to a Python dictionary.
#!/usr/bin/env python
import sys
import os
for line in sys.stdin:
convertedLine = ""
chomped_line = line.rstrip(os.linesep)
if chomped_line.startswith('##'):
pass
elif chomped_line.startswith('track'):
# skip non-standard use of track keyword by Ensembl
pass
else:
elems = chomped_line.split('\t')
cols = dict()
try:
cols['seqname'] = elems[0].lstrip(' ') # strip leading whitespace
cols['source'] = elems[1]
cols['feature'] = elems[2]
cols['start'] = int(elems[3])
cols['end'] = int(elems[4])
cols['score'] = elems[5]
cols['strand'] = elems[6]
cols['frame'] = elems[7]
cols['attributes'] = elems[8].rstrip(' ') # strip trailing whitespace
except IndexError as ie:
sys.stderr.write("[%s] - Error: Input appears to be missing GTF-specific fields (check that your input data is GTF-formatted)\n" % (sys.argv[0]))
sys.exit(os.EX_DATAERR)
try:
cols['comments'] = elems[9]
except IndexError as ie:
cols['comments'] = None
attributes = dict(item.strip().split(' ') for item in cols['attributes'].split(';') if item)
# do stuff with attributes
You could filter out key-value pairs, process certain keys, or rewrite key-value pairs in some desired order, etc.
Login before adding your answer.
Traffic: 1862 users visited in the last hour
I forgot to close the thread when I managed to get what I was looking for. Thanks anyway, I'll try your aproach next time I need to do this again!!