Hello: I have been writing a code for automatic annotation of a genome. Basically, what I've done is to use the DNA sequence of the contig, then predict the ORFs with prodigal, then use this info for BLAST analysis and thus obtain the putative function of each predicted protein. The final goal is to have *.gbk file readable in Artemis.
The code and problems:
The variable record
is supposed to contain both the DNA sequence and all predicted function of the genes, obtained from BLAST analysis (the loop). As I said all info is stored in the same variable record
, like this record = (SeqIO.parse(dna_file,"fasta"),SeqFeature(FeatureLocation(a2, b, strand = c), type = "CDS", qualifiers=quali))
(see the code below).
When I try to save this info as a GenBank
file (final part of the code) an error is shown, also, I think that the variable record
is being overwritten inside the loop which is also a problem.
I don't know what is my mistake at this point, I'm stuck. Please, help!!! - any advice will help
Thanks a lot
###Import GFF reader
from BCBio.GFF import GFFExaminer
from BCBio import GFF
###Import SeqIO for parsing fasta results
from Bio import SeqIO
###Import Seq feature
from Bio.Seq import Seq
from Bio.SeqFeature import SeqFeature, FeatureLocation
###Import BLAST module and XML parser
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
###Reading *.fasta file containing the parent sequence
dna_file = "/Users/k/prodigal/640.fasta"
record = list(SeqIO.parse(dna_file,"fasta"))
###Reading *.gff file with information of gene prediction from PRODIGAL for one contig
gff_file = "/Users/k/prodigal/my640.gff"
gff_handle = open(gff_file)
###Reading .fasta files cointaining protein prediction sequences using PRODIGAL
prot_file = "/Users/k/prodigal/protseq640.fasta"
###Recording protein sequences in a list to BLAST them one by one
prot_list = list(SeqIO.parse(prot_file,"fasta"))
###Parsing resuts from *.gff file
for rec in GFF.parse(gff_handle):
###This will tell you the number of genes in the file
entries = len(rec.features)
###This will extract the information of the gene you chose:
for x in range(0,entries):
feature = rec.features[x]
###Start location
a = feature.location.start
a2 = a+1
###End location
b = feature.location.end
###Strand
c = feature.strand
###Will retrieve the protein sequence needed
prot_in_contig = prot_list[x].seq
###BLAST on line
blast_handle = NCBIWWW.qblast("blastp", "nr", prot_in_contig)
save_file = open("/Users/k/Desktop/my_blast.xml", "w")
save_file.write(blast_handle.read())
save_file.close()
blast_handle.close()
for record in NCBIXML.parse(open("/Users/k/Desktop/my_blast.xml")):
if record.alignments :
##print record.alignments[0].hsps[0].score
blast_match = "%s with e-value of %s" % (record.alignments[0].hit_def,
record.alignments[0].hsps[0].expect)
###All additional information recorded in the section "qualifiers"
quali = {"ID":x+1,"translation":prot_in_contig,"source":"Prediction with PRODIGAL",
"blast_match":blast_match, "blast_match_ID":record.alignments[0].hit_id}
###Variable that records all features: FeatureLocation;
record = (SeqIO.parse(dna_file,"fasta"),SeqFeature(FeatureLocation(a2, b, strand = c), type = "CDS",
qualifiers=quali))
f_handle = open("/Users/k/Desktop/contig", "w")
SeqIO.write(record, f_handle, "genbank")
If you get an error message, it is almost always useful to tell us what the error message is.n