How to create name/rename on header of fasta file with python script?
1
0
Entering edit mode
7.7 years ago
nut_B ▴ 10

Hello everyone,

I would like to create my gene name in header of fasta file, Could anyone help me to create python script for me please?

Original name from gene prediction tool :

unitig0_FGENESH: 7 7 exon (s) 66831 - 67787 71 aa, chain - MCIECVRVLHGFGLNAYIGGARETKRGLGCATVKPATSFFNKGKKEKGLREVFSERYYYYTPWLATPYIKL unitig0_FGENESH: 8 4 exon (s) 70846 - 72746 472 aa, chain + MHESDALDRGTDIRLSPSSTTNICDSKLPLDNESLSQASAAGQGLSEKERELFENAQWQIRSHYLTGSGVDEPGVVKLDDGAVLPLMKKRDIRKDRVSGSKEAKPIPVFPEDRVSEIEIHPDHHDLHKMNGKSNIFVLKTFRHPRDPEIAEEDFKAELLANRQLPRHERIVPLLAAFEFENEFHLIFPFAHQGDLESLWKRTKMLPKHPLPGWYSPRWLLRECLGIAEALVETHSPTLPNNGAGSTVWVPQLHADIKARNILCFQSNDQAPPLLKLADFGYSQRAGEDGALNINGGLPHTKTYRPPERDIENFVGLNYDVWCLGCLYLDFITWAIIGYKGIEDFNESRMEEKNDKYVSKARGNDAEDIFFKKLARLPRWYDASALRFHSQRTEKLYKNTAFAQRSFTFSRGVIKISCDIKISVIQVNGQCQPEFRKLLKIIENDMLVVERQKRASSSKIKSLLQEVIRPQGL unitig1_FGENESH: 201 6 exon (s) 834292 - 836282 511 aa, chain + MYSLITKTLRNEEPSEVELEVVATLVELLTIDLYNLRLSQIGHPMYANFEGVTFRGLSITLEELEEYKAILARPELAKRNFSVPLGLLSSSTDEKIMEEFSKSTHSDGKRPIQLHLTIHIHGLDPALLREYRRLYPDSVVSSICAMPVGHVSPHGEKEILLRGPFFHMISMSSGELNGRPCEKLVVVMMNANRDHGMEHASDQGAKERQRQCFLRAVSASKYEVCAAIAEKHAPWDANAYRTLQNNALQQLQDIDGIYAITDGHLAEHRAKNVATWLGGALRKSYPRYYAKKRVSWQTMIRDENWKEADKILRAEYEWKKRDWYNVGQLTDDGGEFVDDNLTLLHILATKPPPPSAETDRFECWQRLIDGACQPEVWTRRCGNEGLMERLQPRIVHALDPAALGDLELRLHKIMVEKFGQFLSEHIFHMPQLSVLTEMDIPELWIPIPLSYGGVFVTLSKPERIWGLDVVVYEVVVGQTWPKKTDCSLGIYDLARSPDLVYWSSIGDQSSA unitig1_FGENESH: 202 5 exon (s) 836941 - 841297 1319 aa, chain + MASTPFPGSQWPRETPSEDISDQRNNSFYRDRDSSRQTTWDTTDEQSSTEYKNETTSPDSVYHNGYKMWNSIWLQTQVLAGLLVLFVALFLVTILLYHFSEKNHGLSAEDATRQYGWRYGPTAFLTIILSLWVQIDFSNKILTPWQEMRQGHTTADRSVLLEYISPLMITSLWRALKNRHWAVTASGLGILLIQLATVFSTGLFVLQPTALEQDDIPVVVNSVFDGSDFHLTNTSSTIGTGPAILYYGTRVHGLDPLPGVDVSRGLVVPDFTPFTEKAMAGGTNYTAIVPGVESSLDCEYIPALTNATRTSLPWWSILSAFFVLNVTTPSCSISNIIVGQGPDHNIYHQPNATQAYQGYFGDYICDPNINYGFYELPDPSNTTLEHRIVMTMADLRFPPREPRGAGPAYIYIHNLTVAVCKAGYARADYEVTYREGIAGQTKSWTSNKLSTSSSEIPGFSSAQLGAAVHSSLDQAYLGTGGQDWVLSKQVPSFYQILSAMNNNVSIGHFMEPRNLIDSATEAFNGIATQLIYKHMMKPSNTTISGSLLYQEDRLWVRALSVGFMGAAFLLLAGLVIVLLIFRPWNVVPSDPGSIGATALILTESSALRDLLMGLGAARSSQIRHRLSSYNFRSVVSPGPRKTFTVVAIEHGQPTVHQDMLGCSPPQSEHWWVPSAVKWWFQFIAVLLPLVIIAVLEILQRLSDQNNGFVDLGPDGFASTHGLSTYVPAVVAFVVASMFASLQLAVCILAPWLALHKGSAPASRSLFLNLTNRLAPHRMFLAFKNGNLGEVLIMMATFLAAWLPILVSGLYVTIPGTTPQSVTLKQSDVFDFKLNNLFYDDHLAGTVAGLIAFDDLPYRQWTYGDLVFNQLETIDGPGNTAAGNEVPFTARLKATRPSLDCTVVPAHSTMASWDKKQTDYRSIPEDKVALNLTSSIPWMCERRNGNITSVPWFQGFALPKDGRPIYFGHASVLSWGGKVFGNRAIITDVNRPGATSFTPESVANWVGGYGCPSFAVTLGRGSAVSKASGNSTTYDFDIDVTSILCSQRMEVVDTDVTLTLPSLGVISHDTPPVPDESTARYLVNTLRNHTSQIFEFPLNNLLLTLAYGTGEIIIPTSDGEENQLDPFVQFLATVNVSSPIDSLVGRNNSQNLIDATNRLYKTYMPQAIDRNMRTKNLETEVATPDSANAKVEPIQFTTRPEFPGRLRLKQEAAPKIALQAILGFMVLCAILSRVLLKGIDKLVPHNPCSIAGRAALFADGEVSTRKLVPYGAEWRTESELSSAGVYAGWLFSLGWWESWGVYKYGVDIGWIDRGKAENQM unitig153_FGENESH: 674 5 exon (s) 2967149 - 2968758 393 aa, chain + MGVGAVKYTQVACTVCIVTGMLIGICGAKCAGKKTVARYLVEHHGFKSLHIENQAPDPIQNGISPSEASGTEASPGSHANTVEEENDANTRDLVIRPKNGAMRSLHIFESEGALLDFVTKHWRSRWVTTDIHSEAVLDALNRRPFFILISVDAPVTVRWRRHQARQKQVSRPRKGSTSFEDFVAESDAHLYAAHGGVLPLMSRAAIRLLNTSDSLAHLYATLGKLDLTNGDRLRPSWDSYFMALASLAARRCNCMKRAVGCVLVDSKRRVISTGYNGTPRNLTNCMEGGCPRCNSGDATSGVSLATCLCLHAEENALLEAGRERIRDGSVLYCNTCPCLTCSIKIVQVGISEVVYNQGYSMDGETARVFLSAGVKLRQFSPPADGLIHLEKTE unitig153_FGENESH: 675 3 exon (s) 2969361 - 2969732 65 aa, chain - MPAINTAVVARDTVHQLARRENWAQQEAGVIVVFAIVFVVGVGLISLWISKLLKKRKAKKAALGA unitig3_FGENESH: 655 3 exon (s) 2965882 - 2966129 12 aa, chain + MCSVPLLLTLLQ unitig3_FGENESH: 656 4 exon (s) 2973820 - 2977069 384 aa, chain + MQMDRDFEELKEGVKVVGAVILEVQHTVEECKGKISESGEKLEMVRVGLNDFATDMNTLVNSVEADGGSQQGPRPLLQGSLQGRIPQLENENTFLRKGVDTLQATIQNMQQKHAYELAARTSHLQKRDDRCHHQLDRQAELITNVIDTIYSVFADYKEELRLLTHVNAREHNNRAPATDEIRPLSHAAPPHNHFLHHLGGETQGNPILRPEEQPGHDSDTGSSVDFDHDFRQCLREHITEVLDWYQSTVVAAEDVKSLAQRVDHFIYIMCKYHAEQGKIPTIQDVQLGVRILPLPREILLTREGSSTIPDPGHMPIHEEETRRPASAENLAPRSASSSLTFVDEDKVGIEELDITESFVEEGTDAVSRGSDFSCSCEGLPSRFT

and I would like to convert name or rename like this :

XY000_2FG00001_00007 MCIECVRVLHGFGLNAYIGGARETKRGLGCATVKPATSFFNKGKKEKGLREVFSERYYYYTPWLATPYIKL XY000_1FG00002_00008 MHESDALDRGTDIRLSPSSTTNICDSKLPLDNESLSQASAAGQGLSEKERELFENAQWQIRSHYLTGSGVDEPGVVKLDDGAVLPLMKKRDIRKDRVSGSKEAKPIPVFPEDRVSEIEIHPDHHDLHKMNGKSNIFVLKTFRHPRDPEIAEEDFKAELLANRQLPRHERIVPLLAAFEFENEFHLIFPFAHQGDLESLWKRTKMLPKHPLPGWYSPRWLLRECLGIAEALVETHSPTLPNNGAGSTVWVPQLHADIKARNILCFQSNDQAPPLLKLADFGYSQRAGEDGALNINGGLPHTKTYRPPERDIENFVGLNYDVWCLGCLYLDFITWAIIGYKGIEDFNESRMEEKNDKYVSKARGNDAEDIFFKKLARLPRWYDASALRFHSQRTEKLYKNTAFAQRSFTFSRGVIKISCDIKISVIQVNGQCQPEFRKLLKIIENDMLVVERQKRASSSKIKSLLQEVIRPQGL XY001_1FG00003_00201 MYSLITKTLRNEEPSEVELEVVATLVELLTIDLYNLRLSQIGHPMYANFEGVTFRGLSITLEELEEYKAILARPELAKRNFSVPLGLLSSSTDEKIMEEFSKSTHSDGKRPIQLHLTIHIHGLDPALLREYRRLYPDSVVSSICAMPVGHVSPHGEKEILLRGPFFHMISMSSGELNGRPCEKLVVVMMNANRDHGMEHASDQGAKERQRQCFLRAVSASKYEVCAAIAEKHAPWDANAYRTLQNNALQQLQDIDGIYAITDGHLAEHRAKNVATWLGGALRKSYPRYYAKKRVSWQTMIRDENWKEADKILRAEYEWKKRDWYNVGQLTDDGGEFVDDNLTLLHILATKPPPPSAETDRFECWQRLIDGACQPEVWTRRCGNEGLMERLQPRIVHALDPAALGDLELRLHKIMVEKFGQFLSEHIFHMPQLSVLTEMDIPELWIPIPLSYGGVFVTLSKPERIWGLDVVVYEVVVGQTWPKKTDCSLGIYDLARSPDLVYWSSIGDQSSA XY001_1FG00004_00202 MASTPFPGSQWPRETPSEDISDQRNNSFYRDRDSSRQTTWDTTDEQSSTEYKNETTSPDSVYHNGYKMWNSIWLQTQVLAGLLVLFVALFLVTILLYHFSEKNHGLSAEDATRQYGWRYGPTAFLTIILSLWVQIDFSNKILTPWQEMRQGHTTADRSVLLEYISPLMITSLWRALKNRHWAVTASGLGILLIQLATVFSTGLFVLQPTALEQDDIPVVVNSVFDGSDFHLTNTSSTIGTGPAILYYGTRVHGLDPLPGVDVSRGLVVPDFTPFTEKAMAGGTNYTAIVPGVESSLDCEYIPALTNATRTSLPWWSILSAFFVLNVTTPSCSISNIIVGQGPDHNIYHQPNATQAYQGYFGDYICDPNINYGFYELPDPSNTTLEHRIVMTMADLRFPPREPRGAGPAYIYIHNLTVAVCKAGYARADYEVTYREGIAGQTKSWTSNKLSTSSSEIPGFSSAQLGAAVHSSLDQAYLGTGGQDWVLSKQVPSFYQILSAMNNNVSIGHFMEPRNLIDSATEAFNGIATQLIYKHMMKPSNTTISGSLLYQEDRLWVRALSVGFMGAAFLLLAGLVIVLLIFRPWNVVPSDPGSIGATALILTESSALRDLLMGLGAARSSQIRHRLSSYNFRSVVSPGPRKTFTVVAIEHGQPTVHQDMLGCSPPQSEHWWVPSAVKWWFQFIAVLLPLVIIAVLEILQRLSDQNNGFVDLGPDGFASTHGLSTYVPAVVAFVVASMFASLQLAVCILAPWLALHKGSAPASRSLFLNLTNRLAPHRMFLAFKNGNLGEVLIMMATFLAAWLPILVSGLYVTIPGTTPQSVTLKQSDVFDFKLNNLFYDDHLAGTVAGLIAFDDLPYRQWTYGDLVFNQLETIDGPGNTAAGNEVPFTARLKATRPSLDCTVVPAHSTMASWDKKQTDYRSIPEDKVALNLTSSIPWMCERRNGNITSVPWFQGFALPKDGRPIYFGHASVLSWGGKVFGNRAIITDVNRPGATSFTPESVANWVGGYGCPSFAVTLGRGSAVSKASGNSTTYDFDIDVTSILCSQRMEVVDTDVTLTLPSLGVISHDTPPVPDESTARYLVNTLRNHTSQIFEFPLNNLLLTLAYGTGEIIIPTSDGEENQLDPFVQFLATVNVSSPIDSLVGRNNSQNLIDATNRLYKTYMPQAIDRNMRTKNLETEVATPDSANAKVEPIQFTTRPEFPGRLRLKQEAAPKIALQAILGFMVLCAILSRVLLKGIDKLVPHNPCSIAGRAALFADGEVSTRKLVPYGAEWRTESELSSAGVYAGWLFSLGWWESWGVYKYGVDIGWIDRGKAENQM XY153_1FG00005_00674 MGVGAVKYTQVACTVCIVTGMLIGICGAKCAGKKTVARYLVEHHGFKSLHIENQAPDPIQNGISPSEASGTEASPGSHANTVEEENDANTRDLVIRPKNGAMRSLHIFESEGALLDFVTKHWRSRWVTTDIHSEAVLDALNRRPFFILISVDAPVTVRWRRHQARQKQVSRPRKGSTSFEDFVAESDAHLYAAHGGVLPLMSRAAIRLLNTSDSLAHLYATLGKLDLTNGDRLRPSWDSYFMALASLAARRCNCMKRAVGCVLVDSKRRVISTGYNGTPRNLTNCMEGGCPRCNSGDATSGVSLATCLCLHAEENALLEAGRERIRDGSVLYCNTCPCLTCSIKIVQVGISEVVYNQGYSMDGETARVFLSAGVKLRQFSPPADGLIHLEKTE XY153_2FG00006_00675 MPAINTAVVARDTVHQLARRENWAQQEAGVIVVFAIVFVVGVGLISLWISKLLKKRKAKKAALGA XY003_1FG00007_00655 MCSVPLLLTLLQ XY003_1FG00008_00656 MQMDRDFEELKEGVKVVGAVILEVQHTVEECKGKISESGEKLEMVRVGLNDFATDMNTLVNSVEADGGSQQGPRPLLQGSLQGRIPQLENENTFLRKGVDTLQATIQNMQQKHAYELAARTSHLQKRDDRCHHQLDRQAELITNVIDTIYSVFADYKEELRLLTHVNAREHNNRAPATDEIRPLSHAAPPHNHFLHHLGGETQGNPILRPEEQPGHDSDTGSSVDFDHDFRQCLREHITEVLDWYQSTVVAAEDVKSLAQRVDHFIYIMCKYHAEQGKIPTIQDVQLGVRILPLPREILLTREGSSTIPDPGHMPIHEEETRRPASAENLAPRSASSSLTFVDEDKVGIEELDITESFVEEGTDAVSRGSDFSCSCEGLPSRFT

Gene name in XY000_2FG00001_00007 XY = my species name 000 = name of contig (unitig1, unitig2, unitig153.....) 1/2 = strand (if strand + = 1, strand - = 2) FG = FGENESH gene prediction tool 00001 = number of total gene ( I have 12112 genes) [00000, 00001, 00002, 00003, ...., 12112] 00007 = No. of gene ( when predict gene by FGENESH tool) [7, 8 201, 202, 674, 675, 655, 656....]

If anyone know to create python script for my problem, please help me or suggestion me.?

Thank you for advance

python rename fasta header script • 3.0k views
ADD COMMENT
1
Entering edit mode
7.7 years ago
guillaume.rbt ★ 1.0k

Hi,

I would loop over each line, and every two line, split the line to get the info you need.

ex:

line_number = 1
with open("file_path", 'r') as f:
      for line in f:
            if(line_number % 2 != 0): #odd line (id)
                  contig = line.split('_')[0]
                  strand = line[-1:]
                  # ..... get all fields you need and print the id in the good format
            else:
                  print line #even line (protein sequence)
ADD COMMENT
1
Entering edit mode

Thank you very much 'guillaume.rbt', I will try to do this script.

ADD REPLY

Login before adding your answer.

Traffic: 1685 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6