hi, I have a fasta files of a genome, something like:
Strain-01.faa
>IMEHDJCA_03186 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>IMEHDJCA_03189 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
so I want to add an extra ID based in a list within a Genes_file.txt file.
Genes_file.txt
ID Gene Strain-01 Strain-02 Strain-03
ID_01 pphB IMEHDJCA_03186 DIBHEKPI_01648 LLMDBGDK_00598
ID_02 group_1001 IMEHDJCA_03187 DIBHEKPI_01635 LLMDBGDK_00611
ID_03 group_1002 IMEHDJCA_03189 DIBHEKPI_01628 LLMDBGDK_00616
for example for the fasta Strain-01.faa file has the IMEHDJCA_03186 id corresponding to the Strain-01, so I want to add the ID_01 number of the column ID in Genes_file.txt to the header of the sequence, something like, for ID_01 correspond to IMEHDJCA_03186, ID_02 to IMEHDJCA_03187, ID_03 to IMEHDJCA_03189, and the result will be like:
Strain-01_edited.faa
>ID_01 IMEHDJCA_03186 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>ID_02 IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>ID_03 IMEHDJCA_03189 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
I just want to add a ID code of the Genes_file.txt to the header of the fasta file of each strain file.
Any idea to do this ? in bash or R, or any other way ?
Thanks so much
with seqkit: