How to extract protein sequences from a .gff file
1
0
Entering edit mode
16 months ago
Nilo • 0

Hello everyone!

I am a beginner with bioinformatics but at the company I work at we have a genome assembly of one of our crops. I wanted to annotate the genome and to do so I used a piece of python code in ubuntu. I used the Augustus Arabidopsis database and obtained a .gff file which I can visualize in IGV agains the reference genome.

Now I want to do a functional annotation with Blast. I think I should extract the protein sequences from my .gff file and that does not seem to work.

I hope to get some advice on how to do the functional annotation as well as if I am on the right track at all by using the Augustus for annotation.

The best shot I had at trying to make it work was the code below, however, after a couple of seconds of running I got ""Error during BLAST: Error message from NCBI: Message ID#29 Error: Query string not found in the CGI context "message id#29 error query string not found in the cgi context"

Thanks in advance!!

Screenshot from the .gff file when I make it into a .gff.txt file

Screenshot from the .gff file when I make it into a .gff.txt file

This is the code I used at the id#29 error:

#!/usr/bin/env python

from Bio import SeqIO
from Bio.Blast import NCBIWWW, NCBIXML

def extract_protein_sequences(gff_file, genome_file):
    protein_sequences = {}
    with open(gff_file) as gff:
        for line in gff:
            if '\tCDS\t' in line:
                parts = line.split('\t')
                start, end, strand = int(parts[3]), int(parts[4]), parts[6]
                seq_id = parts[0]
                seq_type = parts[2]

                attributes = parts[8].split(';')
                gene_id = None
                for attr in attributes:
                    if attr.startswith('gene_id='):
                        gene_id = attr.split('=')[1]
                        break

                if gene_id is None:
                    continue

                protein_id = f"{seq_id}_{gene_id}_protein"
                if seq_type == 'CDS':
                    sequence = genome_file[seq_id][start-1:end]
                    if strand == '-':
                        sequence = sequence.reverse_complement()
                    protein_sequences[protein_id] = sequence.translate()
    return protein_sequences

def run_blast(protein_sequences):
    try:
        result_handle = NCBIWWW.qblast("blastp", "nr", "\n".join(str(seq) for seq in protein_sequences.values()))
        blast_records = NCBIXML.parse(result_handle)
        return blast_records
    except ValueError as ve:
        print("Error during BLAST:", ve)
        return []


def main(gff_file_path, genome_file_path, output_fasta_file_path):
    genome_file = SeqIO.to_dict(SeqIO.parse(genome_file_path, "fasta"))
    protein_sequences = extract_protein_sequences(gff_file_path, genome_file)
    blast_records = run_blast(protein_sequences)

    with open(output_fasta_file_path, "w") as output_file:
        output_file.write("Protein_ID\tDescription\tE-value\n")
        for record in blast_records:
            if record.alignments:
                alignment = record.alignments[0]
                title = alignment.title.split(" ", 1)
                description = title[1] if len(title) > 1 else ""
                e_value = alignment.hsps[0].expect
                output_file.write(f"{record.query}\t{description}\t{e_value}\n")

if __name__ == "__main__":
    gff_file = "/home/share/Genome_Annotation.gff"
    genome_file = "/home/share/Genome.fasta"
    output_fasta_file = "/home/share/Predicted_proteins.fasta"
    main(gff_file, genome_file, output_fasta_file)
functional-annotation gff blast augustus • 1.9k views
ADD COMMENT
1
Entering edit mode
16 months ago
biofalconch ★ 1.3k

Why not use gffread for this? You can extract the protein sequence using:

gffread -g Genome.fa -y ProteinsOutput.fa YourGFF.gff
ADD COMMENT
0
Entering edit mode

Thanks a lot for helping biofalconch ! Do you think this will also make a correct header for the fasta file in order to be able to get nice visualisation in IGV?

I ran the code and I got this output, it is the same for every single line.

Any ideas on how to fix this?

enter image description here

ADD REPLY
0
Entering edit mode

Do you think this will also make a correct header for the fasta file in order to be able to get nice visualisation in IGV?

If I recall correctly the fasta will have the trasnscript_id as header

I ran the code and I got this output, it is the same for every single line.

I think it might have to do how the gene and transcript entries don't have transcript_id or gene_id on the attribute field, just the names. You could try this to add them:

awk '{OFS="\t"}
{ if ($3=="gene"){
    gsub("^","gene_id=",$9);
    gsub("$",";",$9);
    #print
} else if($3=="transcript"){
    geneID=$9;
    transcriptID=$9;
    gsub("..*",";",geneID);
    gsub("^","gene_id=",geneID);
    gsub("^","transcript_id=",transcriptID);
    $9=geneID" "transcriptID";";
    print $9
} else {
    print
} 
}'

Just note that since I don't have the actual gff I have no idea if this will work :)

ADD REPLY
0
Entering edit mode

Thanks a lot for the help again!

I was trying something else in a different file with this code:

input_genome_file="/files_path/Genome.fasta"

output_gff_file="file_path/Genome_Annotation.gff" 

./getAnnoFasta.pl --seqfile="$input_genome_file" "$output_gff_file"

This gave me 3 output files: .aa, .codingseq and .mrna. The .aa file contained all the proteinsequences with headers like >g1.t1 up untill the last one which is something like >g11763.t1

I tried running this in blast but the file was too large, so next step I will try is running a local blast.I did try it with the first 2 sequences tho. This gave me homologs but when turning it into a .BED file, the .BED file did not localize with the .gff file and referencegenomen.fasta in IGV. So I still think that there is something wrong with the header or the information that it gives to the .BED file.

ADD REPLY

Login before adding your answer.

Traffic: 2098 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6