Question

How to extract protein sequences from a .gff file

0

Entering edit mode

16 months ago

Nilo • 0

Hello everyone!

I am a beginner with bioinformatics but at the company I work at we have a genome assembly of one of our crops. I wanted to annotate the genome and to do so I used a piece of python code in ubuntu. I used the Augustus Arabidopsis database and obtained a .gff file which I can visualize in IGV agains the reference genome.

Now I want to do a functional annotation with Blast. I think I should extract the protein sequences from my .gff file and that does not seem to work.

I hope to get some advice on how to do the functional annotation as well as if I am on the right track at all by using the Augustus for annotation.

The best shot I had at trying to make it work was the code below, however, after a couple of seconds of running I got ""Error during BLAST: Error message from NCBI: Message ID#29 Error: Query string not found in the CGI context "message id#29 error query string not found in the cgi context"

Thanks in advance!!

Screenshot from the .gff file when I make it into a .gff.txt file

This is the code I used at the id#29 error:

#!/usr/bin/env python

from Bio import SeqIO
from Bio.Blast import NCBIWWW, NCBIXML

def extract_protein_sequences(gff_file, genome_file):
    protein_sequences = {}
    with open(gff_file) as gff:
        for line in gff:
            if '\tCDS\t' in line:
                parts = line.split('\t')
                start, end, strand = int(parts[3]), int(parts[4]), parts[6]
                seq_id = parts[0]
                seq_type = parts[2]

                attributes = parts[8].split(';')
                gene_id = None
                for attr in attributes:
                    if attr.startswith('gene_id='):
                        gene_id = attr.split('=')[1]
                        break

                if gene_id is None:
                    continue

                protein_id = f"{seq_id}_{gene_id}_protein"
                if seq_type == 'CDS':
                    sequence = genome_file[seq_id][start-1:end]
                    if strand == '-':
                        sequence = sequence.reverse_complement()
                    protein_sequences[protein_id] = sequence.translate()
    return protein_sequences

def run_blast(protein_sequences):
    try:
        result_handle = NCBIWWW.qblast("blastp", "nr", "\n".join(str(seq) for seq in protein_sequences.values()))
        blast_records = NCBIXML.parse(result_handle)
        return blast_records
    except ValueError as ve:
        print("Error during BLAST:", ve)
        return []


def main(gff_file_path, genome_file_path, output_fasta_file_path):
    genome_file = SeqIO.to_dict(SeqIO.parse(genome_file_path, "fasta"))
    protein_sequences = extract_protein_sequences(gff_file_path, genome_file)
    blast_records = run_blast(protein_sequences)

    with open(output_fasta_file_path, "w") as output_file:
        output_file.write("Protein_ID\tDescription\tE-value\n")
        for record in blast_records:
            if record.alignments:
                alignment = record.alignments[0]
                title = alignment.title.split(" ", 1)
                description = title[1] if len(title) > 1 else ""
                e_value = alignment.hsps[0].expect
                output_file.write(f"{record.query}\t{description}\t{e_value}\n")

if __name__ == "__main__":
    gff_file = "/home/share/Genome_Annotation.gff"
    genome_file = "/home/share/Genome.fasta"
    output_fasta_file = "/home/share/Predicted_proteins.fasta"
    main(gff_file, genome_file, output_fasta_file)

functional-annotation gff blast augustus • 1.9k views

ADD COMMENT • link updated 16 months ago by GenoMax 147k • written 16 months ago by Nilo • 0

GenoMax · Answer 1 · 2023-08-02

1

Entering edit mode

16 months ago

biofalconch ★ 1.3k

Why not use gffread for this? You can extract the protein sequence using:

gffread -g Genome.fa -y ProteinsOutput.fa YourGFF.gff

ADD COMMENT • link 16 months ago by biofalconch ★ 1.3k

0

Entering edit mode

Thanks a lot for helping biofalconch ! Do you think this will also make a correct header for the fasta file in order to be able to get nice visualisation in IGV?

I ran the code and I got this output, it is the same for every single line.

Any ideas on how to fix this?

enter image description here

ADD REPLY • link 16 months ago by Nilo • 0

0

Entering edit mode

Do you think this will also make a correct header for the fasta file in order to be able to get nice visualisation in IGV?

If I recall correctly the fasta will have the trasnscript_id as header

I ran the code and I got this output, it is the same for every single line.

I think it might have to do how the gene and transcript entries don't have transcript_id or gene_id on the attribute field, just the names. You could try this to add them:

awk '{OFS="\t"}
{ if ($3=="gene"){
    gsub("^","gene_id=",$9);
    gsub("$",";",$9);
    #print
} else if($3=="transcript"){
    geneID=$9;
    transcriptID=$9;
    gsub("..*",";",geneID);
    gsub("^","gene_id=",geneID);
    gsub("^","transcript_id=",transcriptID);
    $9=geneID" "transcriptID";";
    print $9
} else {
    print
} 
}'

Just note that since I don't have the actual gff I have no idea if this will work :)

ADD REPLY • link 16 months ago by biofalconch ★ 1.3k

0

Entering edit mode

Thanks a lot for the help again!

I was trying something else in a different file with this code:

input_genome_file="/files_path/Genome.fasta"

output_gff_file="file_path/Genome_Annotation.gff" 

./getAnnoFasta.pl --seqfile="$input_genome_file" "$output_gff_file"

This gave me 3 output files: .aa, .codingseq and .mrna. The .aa file contained all the proteinsequences with headers like >g1.t1 up untill the last one which is something like >g11763.t1

I tried running this in blast but the file was too large, so next step I will try is running a local blast.I did try it with the first 2 sequences tho. This gave me homologs but when turning it into a .BED file, the .BED file did not localize with the .gff file and referencegenomen.fasta in IGV. So I still think that there is something wrong with the header or the information that it gives to the .BED file.

ADD REPLY • link updated 16 months ago by GenoMax 147k • written 16 months ago by Nilo • 0