replace the NP* accession in header of the fasta file with gene name using GFF file
1
0
Entering edit mode
2.2 years ago
najibveto ▴ 120

hello, I would like to know if it is possible replace the NP* accession in header of the fasta file with gene name using GFF file. the fasta file of protein is a fellow:

>NP_037582.1 NADH dehydrogenase subunit 1 (mitochondrion) [Paralichthys olivaceus] 
MISTLITHIINPLALIVPVLLAVAFLTLLERKVLGYMQLRKGPNIVGPYGLLQPIADGIKLFIKEPVRPSTASPVLFLLA PMLALTLALTLWAPMPLPYSTVDLNLGILFVLALSSLAVYSILGSGWASNSKYALVGALRAVAQTISYEVSLGLILLNII ILTGGFTLQTFNTAQEAVWLVLPAWPLAAMWYISTLAETNRAPFDLTEGESELVSGFNVEYAGGPFALFFLAEYSNILLM NTLSAVLFLGASHIPTIATLTAINLMTKAALLSVVFLWVRASYPRFRYDQLMHLIWKNFLPLTLALIIWHLALPTALAGL PPQL

>NP_037583.1 NADH dehydrogenase subunit 2 (mitochondrion) [Paralichthys olivaceus] 
MNPFILTTLLLGLGLGTTITFASSHWLLAWMGLEINTLAIIPLMAQHHHPRAVEATTKYFLTQATAAATLLFASMTNAWL TGQWDIQQMTHPLATTMIIIALALKIGLAPMHSWLPEVLQGLDLTTGLILSTWQKLAPFALLMQIQLDNPTPLIILGLTS TLVGGWGGLNQTQLRKILAYSSIAHLGWMMLILQFSPLLTLLALITYLIMTSSVFLIFKMNKATTINALAISWTKTPILT ALIPLVLLSLGGLPPLTGFMPKWFILQELTKQDLATLATLAALTALLSLYFYLRLSYAMTLTMAPNNLTGTTPWRFSSPQ LTLPLAISSTTATLLLPLAPATLALLTT

>NP_037584.1 cytochrome c oxidase subunit I (mitochondrion) [Paralichthys olivaceus] 
MAITRWFFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGALLGDDQIYNVIVTAHAFVMIFFMVMPIMIGGFGN WLIPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASSGVEAGAGTGWTVYPPLASNLAHAGASVDLTIFSLHLAGISSILG AINFITTIINMKPTTVTMYQIPLFVWAVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLFWFFGH PEVYILILPGFGMISHIVAYYSGKKEPFGYMGMVWAMMAIGLLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAIPTGVKV FSWLATLHGGNIKWETPLLWALGFIFLFTVGGLTGIVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIVAAFVHWFPLF TGYTLHSTWTKVHFGVMFIGVNLTFFPQHFLGLAGMPRRYSDYPDAYALWNTVSSIGSLMSLVAVIMFLFIIWEAFSAKR EVLSVLMTATNVEWLHGCPPPYHTFEEPAFVRAPLN

>NP_037585.1 cytochrome c oxidase subunit II (mitochondrion) [Paralichthys olivaceus] 
MAHPSQLGFQDAASPLMEELLHFHDHALMIVILISTMVLYIIVAMVTAKLTDKLVLDSQEIEIIWTVLPAIILILIALPS LRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYEDLGFDSYMTPTQDLTPGQFRLLEADHRMVTPVESPIRVLISAEDVL HSWAIPALGVKVDAVPGRLNQTTFIISRPGVFFGQCSEICGANHSFMPIVVEAVPLQHFENWSSLMIEEA

>NP_037586.1 ATP synthase F0 subunit 8 (mitochondrion) [Paralichthys olivaceus] 
MPQLNPAPWFMILVFSWMVFLTIIPPKVLAHTFPNEPTPQSTQKPKTESWNWPWY

>NP_037587.1 ATP synthase F0 subunit 6 (mitochondrion) [Paralichthys olivaceus] 
MMLSFFDQFMSPVYLGIPLIALAIILPWALFPTPSSRWMNNRLLTLQGWFINRFTSQLLLPLNLGGHKWATLFASLMIFL LSINMLGLLPYTFTPTTQLSLNMGLAVPLWLATVIIGMRNQPTHALGHLLPEGTPTALIPVLIIIETISLFIRPLALGVR LTANLTAGHLLIQLIATAAFVLLPIMPMIAISTATLLFLLTLLEVAVAMIQAYVFVLLLSLYLQENV

>NP_037588.1 cytochrome c oxidase subunit III (mitochondrion) [Paralichthys olivaceus] 
MAHQAHPYHMVDPSPWPLTGAIAALLMTSGLAIWFHFHSTTLMTLGTILLILTIFQWWRDVVREATFQGHHTPPVQKGLR YGMILFITSEVLFFLGFFWAFYHASLAPTPELGGFWPPAGITPLDPFEVPLLNTAVLLASGVTVTWAHHSIMEGKRKQAI HSLFLTILLGGYFTFLQALEYHEAPFTIADGVYGATFFVATGFHGLHVLIGSTFLAVCLLRQILHHFTANHHFGFEAAAW YWHFVDVVWLFLYISIYWWGS

>NP_037589.1 NADH dehydrogenase subunit 3 (mitochondrion) [Paralichthys olivaceus] 
MSLLMTIITITALLSTILAIVSFWLPQISPDHEKLSPYECGFDPMGSARLPFSLRFFLIAILFLLFDLEIALLLPLPWGD QLPTPLLTFTWATAVLFLLTLGLIYEWIQGGLEWAE

>NP_037590.1 NADH dehydrogenase subunit 4L (mitochondrion) [Paralichthys olivaceus] 
MTPTHFAFSSAFLLGLTGLAFHRFHLLSALLCLEGMMLSLFIALSLWTLQLDSTNFSASPMLLLAFSACEASAGLALLVA TARTHGTDRLQSLNLLQC

>NP_037591.1 NADH dehydrogenase subunit 4 (mitochondrion) [Paralichthys olivaceus] 
MLKILIPTLMLIPTAWLVKPNWLWPTTLTHSFCISLASLSWLKNLSETGWSSLNLCMATDALSTPLLVLTCWLLPLMILA SQNHTASEPINRQRMYITLLTSLQFFLILAFGATEIIMFYVMFEATLIPTLIIITRWGNQTERLNAGTYFLFYTLAGSLP LLVALLLLQNSAGTLSLLTLHYTDPTHMTSYGDKLWWAGCLLAFLVKMPLYGVHLWLPKAHVEAPIAGSMILAAVLLKLG GYGMIRMMTMLEPLTKELSYPFIIFALWGVVMTGSICLRQTDLKSLIAYSSVSHMGLVAGGVLIQSPWGLTGSLILMIAH GLTSSALFCLANTNYERTHSRTMVLARGLQMALPLMATWWFIASLANLALPPLPNLMGELMIIISLFNWSWWTLALTGTG TLITAGYSLYMFLMTQRGPLPTHILALEPSHTREHLLIALHLLPLILLVLKPELIWGWTA

based on the gff file:

> gff-version 3 !gff-spec-version 1.21 !processor NCBI annotwriter
> !genome-build Flounder_ref_guided_V1.0 !genome-build-accession
> NCBI_Assembly:GCF_001970005.1 !annotation-source NCBI Paralichthys
> olivaceus Annotation Release 100 sequence-region NW_017859641.1 1
> 23627241 species
> https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=8255
> NW_017859641.1 RefSeq region 1 23627241 . + .
> ID=NW_017859641.1:1..23627241;Dbxref=taxon:8255;Name=Unknown;breed=gynogenesis;chromosome=Unknown;dev-stage=adult;gbkey=Src;genome=genomic;mol_type=genomic
> DNA;sex=female;tissue-type=blood NW_017859641.1 Gnomon gene 39370
> 44846 . + .
> ID=gene-LOC109628151;Dbxref=GeneID:109628151;Name=LOC109628151;gbkey=Gene;gene=LOC109628151;gene_biotype=protein_coding NW_017859641.1 Gnomon mRNA 39370 44846 . + .
> ID=rna-XM_020085119.1;Parent=gene-LOC109628151;Dbxref=GeneID:109628151,Genbank:XM_020085119.1;Name=XM_020085119.1;gbkey=mRNA;gene=LOC109628151;model_evidence=Supporting evidence includes similarity to: 1 EST%2C 17 Proteins%2C and 100%25
> coverage of the annotated genomic feature by RNAseq alignments%2C
> including 14 samples with support for all annotated
> introns;product=WD40 repeat-containing protein
> SMU1;transcript_id=XM_020085119.1 NW_017859641.1 Gnomon exon 39370
> 39492 . + .
> ID=exon-XM_020085119.1-1;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XM_020085119.1;gbkey=mRNA;gene=LOC109628151;product=WD40
> repeat-containing protein SMU1;transcript_id=XM_020085119.1
> NW_017859641.1 Gnomon exon 40402 40612 . + .
> ID=exon-XM_020085119.1-2;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XM_020085119.1;gbkey=mRNA;gene=LOC109628151;product=WD40
> repeat-containing protein SMU1;transcript_id=XM_020085119.1
> NW_017859641.1 Gnomon exon 40851 41003 . + .
> ID=exon-XM_020085119.1-3;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XM_020085119.1;gbkey=mRNA;gene=LOC109628151;product=WD40
> repeat-containing protein SMU1;transcript_id=XM_020085119.1
> NW_017859641.1 Gnomon exon 41172 41282 . + .
> ID=exon-XM_020085119.1-4;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XM_020085119.1;gbkey=mRNA;gene=LOC109628151;product=WD40
> repeat-containing protein SMU1;transcript_id=XM_020085119.1
> NW_017859641.1 Gnomon exon 41412 41540 . + .
> ID=exon-XM_020085119.1-5;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XM_020085119.1;gbkey=mRNA;gene=LOC109628151;product=WD40
> repeat-containing protein SMU1;transcript_id=XM_020085119.1
> NW_017859641.1 Gnomon exon 41692 41811 . + .
> ID=exon-XM_020085119.1-6;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XM_020085119.1;gbkey=mRNA;gene=LOC109628151;product=WD40
> repeat-containing protein SMU1;transcript_id=XM_020085119.1
> NW_017859641.1 Gnomon exon 41922 42038 . + .
> ID=exon-XM_020085119.1-7;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XM_020085119.1;gbkey=mRNA;gene=LOC109628151;product=WD40
> repeat-containing protein SMU1;transcript_id=XM_020085119.1
> NW_017859641.1 Gnomon exon 42250 42377 . + .
> ID=exon-XM_020085119.1-8;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XM_020085119.1;gbkey=mRNA;gene=LOC109628151;product=WD40
> repeat-containing protein SMU1;transcript_id=XM_020085119.1
> NW_017859641.1 Gnomon exon 42511 42637 . + .
> ID=exon-XM_020085119.1-9;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XM_020085119.1;gbkey=mRNA;gene=LOC109628151;product=WD40
> repeat-containing protein SMU1;transcript_id=XM_020085119.1
> NW_017859641.1 Gnomon exon 42717 42884 . + .
> ID=exon-XM_020085119.1-10;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XM_020085119.1;gbkey=mRNA;gene=LOC109628151;product=WD40
> repeat-containing protein SMU1;transcript_id=XM_020085119.1
> NW_017859641.1 Gnomon exon 43227 43379 . + .
> ID=exon-XM_020085119.1-11;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XM_020085119.1;gbkey=mRNA;gene=LOC109628151;product=WD40
> repeat-containing protein SMU1;transcript_id=XM_020085119.1
> NW_017859641.1 Gnomon exon 43829 44846 . + .
> ID=exon-XM_020085119.1-12;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XM_020085119.1;gbkey=mRNA;gene=LOC109628151;product=WD40
> repeat-containing protein SMU1;transcript_id=XM_020085119.1
> NW_017859641.1 Gnomon CDS 39467 39492 . + 0
> ID=cds-XP_019940678.1;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XP_019940678.1;Name=XP_019940678.1;gbkey=CDS;gene=LOC109628151;product=WD40 repeat-containing protein SMU1;protein_id=XP_019940678.1
> NW_017859641.1 Gnomon CDS 40402 40612 . + 1
> ID=cds-XP_019940678.1;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XP_019940678.1;Name=XP_019940678.1;gbkey=CDS;gene=LOC109628151;product=WD40 repeat-containing protein SMU1;protein_id=XP_019940678.1
> NW_017859641.1 Gnomon CDS 40851 41003 . + 0
> ID=cds-XP_019940678.1;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XP_019940678.1;Name=XP_019940678.1;gbkey=CDS;gene=LOC109628151;product=WD40 repeat-containing protein SMU1;protein_id=XP_019940678.1
> NW_017859641.1 Gnomon CDS 41172 41282 . + 0
> ID=cds-XP_019940678.1;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XP_019940678.1;Name=XP_019940678.1;gbkey=CDS;gene=LOC109628151;product=WD40 repeat-containing protein SMU1;protein_id=XP_019940678.1
> NW_017859641.1 Gnomon CDS 41412 41540 . + 0
> ID=cds-XP_019940678.1;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XP_019940678.1;Name=XP_019940678.1;gbkey=CDS;gene=LOC109628151;product=WD40 repeat-containing protein SMU1;protein_id=XP_019940678.1
> NW_017859641.1 Gnomon CDS 41692 41811 . + 0
> ID=cds-XP_019940678.1;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XP_019940678.1;Name=XP_019940678.1;gbkey=CDS;gene=LOC109628151;product=WD40 repeat-containing protein SMU1;protein_id=XP_019940678.1
> NW_017859641.1 Gnomon CDS 41922 42038 . + 0
> ID=cds-XP_019940678.1;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XP_019940678.1;Name=XP_019940678.1;gbkey=CDS;gene=LOC109628151;product=WD40 repeat-containing protein SMU1;protein_id=XP_019940678.1
> NW_017859641.1 Gnomon CDS 42250 42377 . + 0
> ID=cds-XP_019940678.1;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XP_019940678.1;Name=XP_019940678.1;gbkey=CDS;gene=LOC109628151;product=WD40 repeat-containing protein SMU1;protein_id=XP_019940678.1
> NW_017859641.1 Gnomon CDS 42511 42637 . + 1
> ID=cds-XP_019940678.1;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XP_019940678.1;Name=XP_019940678.1;gbkey=CDS;gene=LOC109628151;product=WD40 repeat-containing protein SMU1;protein_id=XP_019940678.1
> NW_017859641.1 Gnomon CDS 42717 42884 . + 0
> ID=cds-XP_019940678.1;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XP_019940678.1;Name=XP_019940678.1;gbkey=CDS;gene=LOC109628151;product=WD40 repeat-containing protein SMU1;protein_id=XP_019940678.1
> NW_017859641.1 Gnomon CDS 43227 43379 . + 0
> ID=cds-XP_019940678.1;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XP_019940678.1;Name=XP_019940678.1;gbkey=CDS;gene=LOC109628151;product=WD40 repeat-containing protein SMU1;protein_id=XP_019940678.1
> NW_017859641.1 Gnomon CDS 43829 43927 . + 0
> ID=cds-XP_019940678.1;Parent=rna-XM_020085119.1;Dbxref=GeneID:109628151,Genbank:XP_019940678.1;Name=XP_019940678.1;gbkey=CDS;gene=LOC109628151;product=WD40 repeat-containing protein SMU1;protein_id=XP_019940678.1
> NW_017859641.1 Gnomon gene 45747 54972 . + .
> ID=gene-LOC109633196;Dbxref=GeneID:109633196;Name=LOC109633196;gbkey=Gene;gene=LOC109633196;gene_biotype=protein_coding

how is it possible to change the protein id into gene name? thank you for your help.

gene NP name GFF accession • 726 views
ADD COMMENT
2
Entering edit mode
2.2 years ago

Not possible I'm afraid. At least not given those two files. Main reason being that those NP* accession numbers do not appear in the GFF file.

You might be able to achieve this after all, if you extract the gene/protein sequences based on the GFF file and compare the sequences to your input file.

ADD COMMENT
0
Entering edit mode

thank you for the reply, the GFF contains other accession numbers for NP, just i put a small part of the fasta file and GFF file. I put the link for both GFF and faa: olive founder protein fasta

the gff file for olive flounder

ADD REPLY

Login before adding your answer.

Traffic: 2392 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6