Question

Protein secuence extraction from gpff file for a genes list

0

Entering edit mode

5 weeks ago

reza ▴ 300

i have a list of gene names and a file (gpff format) including proteins sequences. i want to extract protein sequence from gpff format file for each gene. how can i do this?

A part of gpff format file

LOCUS       XP_031247110             372 aa            linear   PLN 22-OCT-2019
DEFINITION  GDSL esterase/lipase At4g16230-like [Pistacia vera].
ACCESSION   XP_031247110
VERSION     XP_031247110.1
DBLINK      BioProject: PRJNA578116
DBSOURCE    REFSEQ: accession XM_031391250.1
KEYWORDS    RefSeq; includes ab initio.
SOURCE      Pistacia vera
  ORGANISM  Pistacia vera
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliopsida; eudicotyledons; Gunneridae;
            Pentapetalae; rosids; malvids; Sapindales; Anacardiaceae; Pistacia.
COMMENT     MODEL REFSEQ:  This record is predicted by automated computational
            analysis. This record is derived from a genomic sequence
            (NW_022196320.1) annotated using gene prediction method: Gnomon.
            Also see:
                Documentation of NCBI's Annotation Process

            ##Genome-Annotation-Data-START##
            Annotation Provider         :: NCBI
            Annotation Status           :: Full annotation
            Annotation Name             :: Pistacia vera Annotation Release 100
            Annotation Version          :: 100
            Annotation Pipeline         :: NCBI eukaryotic genome annotation
                                           pipeline
            Annotation Software Version :: 8.2
            Annotation Method           :: Best-placed RefSeq; Gnomon
            Features Annotated          :: Gene; mRNA; CDS; ncRNA
            ##Genome-Annotation-Data-END##

            ##RefSeq-Attributes-START##
            ab initio :: 8% of CDS bases
            ##RefSeq-Attributes-END##
            COMPLETENESS: full length.
FEATURES             Location/Qualifiers
     source          1..372
                     /organism="Pistacia vera"
                     /cultivar="Batoury"
                     /db_xref="taxon:55513"
                     /chromosome="Unknown"
                     /tissue_type="leaf"
                     /country="China"
     Protein         1..372
                     /product="GDSL esterase/lipase At4g16230-like"
                     /calculated_mol_wt=40507
     CDS             1..372
                     /gene="LOC116104818"
                     /coded_by="XM_031391250.1:1..1119"
                     /db_xref="GeneID:116104818"
ORIGIN      
        1 mtekiptkfl llcfpllaif fpcnvycwst ygsqikgmfv fgsslvdngn nnflltlaka
       61 nyspygvdfp ggpsgrftng mnvidllgee lqlpslipvf ydpstkggrt ivhgvnyasg
      121 gsgilndtgs iagnvvslne qirnfdevtl pelkthvdcr stdllhnylf vvgsggndys
      181 fnyfltqana nvsveaftdn linslsqqlk klyslggrkf vlmsvnplgc npvarasqpt
      241 gqdgciqvln qaahlfnsrl rltvdfirpq mpgstlvfvn sykiitdiig dpvsngfndt
      301 rkaccqvlsv neggngilck rggrvcaern ihvffdglhp teavniqiak kafgsynrde
      361 vypinvrqla kl

Protein gpff • 293 views

ADD COMMENT • link updated 5 weeks ago by JC 13k • written 5 weeks ago by reza ▴ 300

score 0 · Answer 1 · 2025-02-20

This is the first time I've heard about this format but seems a GBK modified for proteins, you can parse with some Perl:

#!/usr/bin/perl
use strict;
use warnings;

my $gen;
while (<>) {
    # get gene id
    if (/gene="(.+?)"/) {
        $gen = $1;
        print ">$gen\n";
    } elsif (/^\s+\d+/) { # sequence lines start with spaces and numbers
        s/\s//g; # remove spaces
        s/\d//g; # remove digits
        print "$_\n"; # print seq
    }
}

Example:

$ perl gpff_parser.pl < file.gpff
>LOC116104818
mtekiptkflllcfpllaiffpcnvycwstygsqikgmfvfgsslvdngnnnflltlaka
nyspygvdfpggpsgrftngmnvidllgeelqlpslipvfydpstkggrtivhgvnyasg
gsgilndtgsiagnvvslneqirnfdevtlpelkthvdcrstdllhnylfvvgsggndys
fnyfltqananvsveaftdnlinslsqqlkklyslggrkfvlmsvnplgcnpvarasqpt
gqdgciqvlnqaahlfnsrlrltvdfirpqmpgstlvfvnsykiitdiigdpvsngfndt
rkaccqvlsvneggngilckrggrvcaernihvffdglhpteavniqiakkafgsynrde
vypinvrqlakl
$ perl gpff_parser.pl < file.gpff > sequence.fasta

I don't know if you want other attributes being captured but the script is easy to adapt to what you need.