Hello everyone!
I am a beginner with bioinformatics but at the company I work at we have a genome assembly of one of our crops. I wanted to annotate the genome and to do so I used a piece of python code in ubuntu. I used the Augustus Arabidopsis database and obtained a .gff file which I can visualize in IGV agains the reference genome.
Now I want to do a functional annotation with Blast. I think I should extract the protein sequences from my .gff file and that does not seem to work.
I hope to get some advice on how to do the functional annotation as well as if I am on the right track at all by using the Augustus for annotation.
The best shot I had at trying to make it work was the code below, however, after a couple of seconds of running I got ""Error during BLAST: Error message from NCBI: Message ID#29 Error: Query string not found in the CGI context "message id#29 error query string not found in the cgi context"
Thanks in advance!!
Screenshot from the .gff file when I make it into a .gff.txt file
This is the code I used at the id#29 error:
#!/usr/bin/env python
from Bio import SeqIO
from Bio.Blast import NCBIWWW, NCBIXML
def extract_protein_sequences(gff_file, genome_file):
protein_sequences = {}
with open(gff_file) as gff:
for line in gff:
if '\tCDS\t' in line:
parts = line.split('\t')
start, end, strand = int(parts[3]), int(parts[4]), parts[6]
seq_id = parts[0]
seq_type = parts[2]
attributes = parts[8].split(';')
gene_id = None
for attr in attributes:
if attr.startswith('gene_id='):
gene_id = attr.split('=')[1]
break
if gene_id is None:
continue
protein_id = f"{seq_id}_{gene_id}_protein"
if seq_type == 'CDS':
sequence = genome_file[seq_id][start-1:end]
if strand == '-':
sequence = sequence.reverse_complement()
protein_sequences[protein_id] = sequence.translate()
return protein_sequences
def run_blast(protein_sequences):
try:
result_handle = NCBIWWW.qblast("blastp", "nr", "\n".join(str(seq) for seq in protein_sequences.values()))
blast_records = NCBIXML.parse(result_handle)
return blast_records
except ValueError as ve:
print("Error during BLAST:", ve)
return []
def main(gff_file_path, genome_file_path, output_fasta_file_path):
genome_file = SeqIO.to_dict(SeqIO.parse(genome_file_path, "fasta"))
protein_sequences = extract_protein_sequences(gff_file_path, genome_file)
blast_records = run_blast(protein_sequences)
with open(output_fasta_file_path, "w") as output_file:
output_file.write("Protein_ID\tDescription\tE-value\n")
for record in blast_records:
if record.alignments:
alignment = record.alignments[0]
title = alignment.title.split(" ", 1)
description = title[1] if len(title) > 1 else ""
e_value = alignment.hsps[0].expect
output_file.write(f"{record.query}\t{description}\t{e_value}\n")
if __name__ == "__main__":
gff_file = "/home/share/Genome_Annotation.gff"
genome_file = "/home/share/Genome.fasta"
output_fasta_file = "/home/share/Predicted_proteins.fasta"
main(gff_file, genome_file, output_fasta_file)
Thanks a lot for helping biofalconch ! Do you think this will also make a correct header for the fasta file in order to be able to get nice visualisation in IGV?
I ran the code and I got this output, it is the same for every single line.
Any ideas on how to fix this?
If I recall correctly the fasta will have the
trasnscript_id
as headerI think it might have to do how the gene and transcript entries don't have transcript_id or gene_id on the attribute field, just the names. You could try this to add them:
Just note that since I don't have the actual gff I have no idea if this will work :)
Thanks a lot for the help again!
I was trying something else in a different file with this code:
This gave me 3 output files: .aa, .codingseq and .mrna. The .aa file contained all the proteinsequences with headers like >g1.t1 up untill the last one which is something like >g11763.t1
I tried running this in blast but the file was too large, so next step I will try is running a local blast.I did try it with the first 2 sequences tho. This gave me homologs but when turning it into a .BED file, the .BED file did not localize with the .gff file and referencegenomen.fasta in IGV. So I still think that there is something wrong with the header or the information that it gives to the .BED file.