Hi community,
I know this is a common type of question since there are lots of posts about FASTA headers, but since I don't have a good basis in informatics or with sed's syntax I don't really know how to make the right command. So, I have a FASTA file that looks like this:
>CDS CDS FIG00947432: hypothetical protein 2322:2624 forward MW:11399
MSSKYPVAYAVGQKIKSLRKSQGYTVFQLAKEIDISEQQLFRYERGVNRIDIDCLVRVLE
VLGVNIGSFFEEVTGGMAQEIERNEQHIPSHFDSKALSIF
>CDS CDS Thiosulfate reductase cytochrome B subunit 3993:4772 reverse MW:29554
MTSIWGAELHYTPDYWPVWLMAAGLLIVAMIAVLVIHGLLRYALAPKHTGHYEEERVYLY
SKAIRFWHWGNALLFILLLLSGFLGHFSIGNVTSMVLLHKICGFVLIAFWIGFILINLTT
SNGVHYKVRFSGLIGRCIKQARFYLYGIMKGEPHPFAATETDKFNPLQQLAYLGVMFGLV
PLLLVTGLLCLYPEVLGYGYWMLKAHLVLGIVALMFICAHFYLCTLGDTFTQTFRSMVDG
HHRHQKHDNHRSANEKVEH
>CDS CDS Thiosulfate reductase electron transport protein phsB 4769:5344 reverse MW:21362
MNNNKQFVMLHDEKRCIGCQACTVACKVINDIPEGFSRLQVQIQGPHNDEAGNPHYQFFR
VSCQHCEDAPCVSVCPTGASFIDENGIVQVKKELCIGCDYCVGACPYHVRYINPMTHIAD
KCNFCSDTRLTEGELPACVSVCPTDALAFGRIDSPEIQAWIKQKSVYQYQLDNVGKPSLF
RRKEIHQGDKA
>CDS CDS Thiosulfate reductase precursor (EC 1.-.-.-) 5359:7638 reverse MW:83512
MSISRRSFIKGMGVGCVGCTVSSLPPGALAFNPVDSLKGQSTLTPSLCEMCSYRCPIEAQ
VVNNKTVFIQGNRNAEHQSSRVCARGGSGVSLVNDPNRIVKPMKHKGPRGAGEWEVISWE
QAYKEIAEKMNAIKQNYGAESISFSSKSGSLSSHLFHLAAAFGSPNTFTHASTCPAGKAI
AASVMMGGDLKMDLANSKYILSFGHNLYEGIEVAETHELMTAQERGAKLVSFDPRLSVVS
SKADEWFAIRPGGDLPVLMAMCHILIKEDLYDKEFVEKFTVGFPQLKDVLQETTPEWAQA
HSDVPAKDIVRIAREIAAKAPHALIMPGHRATFNKEEINMRRMIFTFNALLGNIEREGGL
YQKKAATKYNKLAGIAVAPELAKPSVKGMPEITAKRIDATAPQFKYINKGGGIVQSIIDS
TLEGVPYQTKAWIMSRHNPFQTVSCRPDLEKAAQKLDLIVSCDVYLSESAAYADYLLPEC
TYLERDEEVADVSGLNPAYALRQQVVEPIGDTKPSWLIWMELGKALGLEACFPWENMGVR
QLYQVNGSEELYKEMHKKGYISYGVPLLLREPSYVKAFVDQYPDAIKQVDSNNTMEKALS
FKSPSGLIEIYSEELESRLENYGIPRFHNFPLKEKDELYFIQGKVAVHTNGATQYVPLLA
ELMWENPVWLHPETAKNHGIKHGDEIILENSVGKEKARALITEGIRPDTVFVYMGSGAKA
GAKTAATTTGVHCGNLLPHEISPVSGTDVHTSGVRISRA
I want to rename all FASTA headers so it contains a number after the first CDS. The output would be like this:
>CDS1 CDS FIG00947432: hypothetical protein 2322:2624 forward MW:11399
MSSKYPVAYAVGQKIKSLRKSQGYTVFQLAKEIDISEQQLFRYERGVNRIDIDCLVRVLE
VLGVNIGSFFEEVTGGMAQEIERNEQHIPSHFDSKALSIF
>CDS2 CDS Thiosulfate reductase cytochrome B subunit 3993:4772 reverse MW:29554
MTSIWGAELHYTPDYWPVWLMAAGLLIVAMIAVLVIHGLLRYALAPKHTGHYEEERVYLY
SKAIRFWHWGNALLFILLLLSGFLGHFSIGNVTSMVLLHKICGFVLIAFWIGFILINLTT
SNGVHYKVRFSGLIGRCIKQARFYLYGIMKGEPHPFAATETDKFNPLQQLAYLGVMFGLV
PLLLVTGLLCLYPEVLGYGYWMLKAHLVLGIVALMFICAHFYLCTLGDTFTQTFRSMVDG
HHRHQKHDNHRSANEKVEH
>CDS3 CDS Thiosulfate reductase electron transport protein phsB 4769:5344 reverse MW:21362
MNNNKQFVMLHDEKRCIGCQACTVACKVINDIPEGFSRLQVQIQGPHNDEAGNPHYQFFR
VSCQHCEDAPCVSVCPTGASFIDENGIVQVKKELCIGCDYCVGACPYHVRYINPMTHIAD
KCNFCSDTRLTEGELPACVSVCPTDALAFGRIDSPEIQAWIKQKSVYQYQLDNVGKPSLF
RRKEIHQGDKA
>CDS4 CDS Thiosulfate reductase precursor (EC 1.-.-.-) 5359:7638 reverse MW:83512
MSISRRSFIKGMGVGCVGCTVSSLPPGALAFNPVDSLKGQSTLTPSLCEMCSYRCPIEAQ
VVNNKTVFIQGNRNAEHQSSRVCARGGSGVSLVNDPNRIVKPMKHKGPRGAGEWEVISWE
QAYKEIAEKMNAIKQNYGAESISFSSKSGSLSSHLFHLAAAFGSPNTFTHASTCPAGKAI
AASVMMGGDLKMDLANSKYILSFGHNLYEGIEVAETHELMTAQERGAKLVSFDPRLSVVS
SKADEWFAIRPGGDLPVLMAMCHILIKEDLYDKEFVEKFTVGFPQLKDVLQETTPEWAQA
HSDVPAKDIVRIAREIAAKAPHALIMPGHRATFNKEEINMRRMIFTFNALLGNIEREGGL
YQKKAATKYNKLAGIAVAPELAKPSVKGMPEITAKRIDATAPQFKYINKGGGIVQSIIDS
TLEGVPYQTKAWIMSRHNPFQTVSCRPDLEKAAQKLDLIVSCDVYLSESAAYADYLLPEC
TYLERDEEVADVSGLNPAYALRQQVVEPIGDTKPSWLIWMELGKALGLEACFPWENMGVR
QLYQVNGSEELYKEMHKKGYISYGVPLLLREPSYVKAFVDQYPDAIKQVDSNNTMEKALS
FKSPSGLIEIYSEELESRLENYGIPRFHNFPLKEKDELYFIQGKVAVHTNGATQYVPLLA
ELMWENPVWLHPETAKNHGIKHGDEIILENSVGKEKARALITEGIRPDTVFVYMGSGAKA
GAKTAATTTGVHCGNLLPHEISPVSGTDVHTSGVRISRA
The motive is simple. I'll be submitting this FASTA file to softwares like SurfG+, MEDpipe and inmembrane. Since this file does not contain CDSs with any kind of identification or enumeration, the software's output wouldn't tell me which one is which. Thank you all in advance.
This is a commonly asked question here. You should find multiple threads to help with this. Use google to do an external search against Biostars. Internal Biostars search engine is not the best.
Here is one: How To Rename FASTA Headers