Hi to all, I have used GLIMMER 3.2 to predict genes in the bacterial genomes and I get the ORF ID and not the protein sequences. These sequences have to be used in the another tool and it should be in FASTA FORMAT.So I have used a biopython script to convert gene predictions in GFF3 format to protein sequences.I am using .predict
file of GLIMMER which I am not sure as it not exactly in GFF3 format, tried checking it using GFF3 validator and only the first few lines are in GFF3 format, GFF3 format indeed is divided into many kinds which I read many days before I am still not able to figure out which format is the GLIMMER output.The biopython script I am running gives the error. Plz help as I new to the world of bioinformatics,LINUX.The biopython script is here
#!/usr/bin/env python
"""Convert GlimmerHMM GFF3 gene predictions into protein sequences.
This works with the GlimmerHMM GFF3 output format:
##gff-version 3
##sequence-region Contig5.15 1 47390
Contig5.15 GlimmerHMM mRNA 323 325 . + . ID=Contig5.15.path1.gene1;Name=Contig5.15.path1.gene1
Contig5.15 GlimmerHMM CDS 323 325 . + 0 ID=Contig5.15.cds1.1;Parent=Contig5.15.path1.gene1;Name=Contig5.15.path1.gene1;Note=final-exon
http://www.cbcb.umd.edu/software/GlimmerHMM/
Usage:
glimmergff_to_proteins.py <glimmer gff3> <ref fasta>
"""
from __future__ import with_statement
import sys
import os
import operator
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from BCBio import GFF
def main(glimmer_file, ref_file):
with open(ref_file) as in_handle:
ref_recs = SeqIO.to_dict(SeqIO.parse(in_handle, "fasta"))
base, ext = os.path.splitext(glimmer_file)
out_file = "%s-proteins.fa" % base
with open(out_file, "w") as out_handle:
SeqIO.write(protein_recs(glimmer_file, ref_recs), out_handle, "fasta")
def protein_recs(glimmer_file, ref_recs):
"""Generate protein records from GlimmerHMM gene predictions.
"""
with open(glimmer_file) as in_handle:
for rec in glimmer_predictions(in_handle, ref_recs):
for feature in rec.features:
seq_exons = []
for cds in feature.sub_features:
seq_exons.append(rec.seq[
cds.location.nofuzzy_start:
cds.location.nofuzzy_end])
gene_seq = reduce(operator.add, seq_exons)
if feature.strand == -1:
gene_seq = gene_seq.reverse_complement()
protein_seq = gene_seq.translate()
yield SeqRecord(protein_seq, feature.qualifiers["ID"][0], "", "")
def glimmer_predictions(in_handle, ref_recs):
"""Parse Glimmer output, generating SeqRecord and SeqFeatures for predictions
"""
for rec in GFF.parse(in_handle, target_lines=1000, base_dict=ref_recs):
yield rec
if __name__ == "__main__":
if len(sys.argv) != 3:
print __doc__
sys.exit()
main(*sys.argv[1:])
The error which I get is I have tried running this script many times and I get this error most of the times
Traceback (most recent call last):
File "glimmergff_to_proteins.py", line 62, in <module>
main(*sys.argv[1:])
File "glimmergff_to_proteins.py", line 27, in main
with open(ref_file) as in_handle:
Value Error:More than one record found in handle
Every time I run the script I get different errors,The previous error which I think was due to the multiple FASTA file format for the reference file and when the split file was run it shows the same, further splitting also shows the same error.The most recent error which I got was
Traceback (most recent call last):
File "glimmergff_to_proteins.py", line 62, in <module>
main(*sys.argv[1:])
File "glimmergff_to_proteins.py", line 33, in main
SeqIO.write(protein_recs(glimmer_file, ref_recs), out_handle, "fasta")
File "/usr/local/lib/python2.7/dist-packages/Bio/SeqIO/__init__.py", line 426, in write
count = writer_class(fp).write_file(sequences)
File "/usr/local/lib/python2.7/dist-packages/Bio/SeqIO/Interfaces.py", line 254, in write_file
count = self.write_records(records)
File "/usr/local/lib/python2.7/dist-packages/Bio/SeqIO/Interfaces.py", line 238, in write_records
for record in records:
File "glimmergff_to_proteins.py", line 39, in protein_recs
for rec in glimmer_predictions(in_handle, ref_recs):
File "glimmergff_to_proteins.py", line 55, in glimmer_predictions
for rec in GFF.parse(in_handle, target_lines=1000, base_dict=ref_recs):
File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 712, in parse
target_lines):
File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 299, in parse_in_parts
for results in self.parse_simple(gff_files, limit_info, target_lines):
File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 320, in parse_simple
for results in self._gff_process(gff_files, limit_info, target_lines):
File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 606, in _gff_process
for out in self._lines_to_out_info(line_gen, limit_info, target_lines):
File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 637, in _lines_to_out_info
results = self._map_fn(line, params)
File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 157, in _gff_line_map
8w assert len(parts) >= 8, line
Please reply and help in making the script work, have been working on this for months but without any results. Please feel free to mail me (nope: all responses stay on site (md)). My mail id is [deleted]
Here is the link to the script so you can check the line number
Dear genomelover, it's quite bad netiquette to be shouting for help in uppercase letters, and to demand a reply ASAP...
If you wish us to try to help you, please take the time to edit your question to remove shouting, and please do edit your error messages, so that the post is easier to read.
If you also take the time to tell us what you have already tried to get rid of those errors, you will drastically improve your chances of getting a quick reply : http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002202.
Also, including line numbers in the script above would be helpful, as without them, it's difficult to trace the errors based on the current information? I don't know what line 27 and line 62 are.
i have added the link to the script which contains the line number,kindly reply soon
I see it is long time you posted this question. Since I was looking for something similar, I was trying the code you suggested. I did not get any error. The problem is the following and it is often not considered by people who thinks to be good at coding:
gene predictions can be partial. Therefore the frame of the first exon can be different from the 0th (that is the first codon can be partial). It follows that a good gff to sequence script has to take this into account e.g. by trimming the sequence of the first exon before translating it. It seems to me that the script you were using does not take this fact into account. If this is so, it will output completely wrong protein sequences in those cases.