Add text to fasta header based in a list
1
1
Entering edit mode
3.4 years ago
cabraham03 ▴ 30

hi, I have a fasta files of a genome, something like:

Strain-01.faa

>IMEHDJCA_03186 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>IMEHDJCA_03189 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD

so I want to add an extra ID based in a list within a Genes_file.txt file.

Genes_file.txt

ID      Gene        Strain-01       Strain-02       Strain-03
ID_01   pphB        IMEHDJCA_03186  DIBHEKPI_01648  LLMDBGDK_00598
ID_02   group_1001  IMEHDJCA_03187  DIBHEKPI_01635  LLMDBGDK_00611
ID_03   group_1002  IMEHDJCA_03189  DIBHEKPI_01628  LLMDBGDK_00616

for example for the fasta Strain-01.faa file has the IMEHDJCA_03186 id corresponding to the Strain-01, so I want to add the ID_01 number of the column ID in Genes_file.txt to the header of the sequence, something like, for ID_01 correspond to IMEHDJCA_03186, ID_02 to IMEHDJCA_03187, ID_03 to IMEHDJCA_03189, and the result will be like:

Strain-01_edited.faa

>ID_01 IMEHDJCA_03186 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>ID_02 IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>ID_03 IMEHDJCA_03189 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD

I just want to add a ID code of the Genes_file.txt to the header of the fasta file of each strain file.

Any idea to do this ? in bash or R, or any other way ?

Thanks so much

fasta header R bash • 852 views
ADD COMMENT
0
Entering edit mode

with seqkit:

$ seqkit -w 0 --quiet replace -K -p '^(\w+)( .+)$' -r '{kv} ${1} ${2}' -k <(awk -v OFS="\t" 'NR > 1 {print $3,$1}' test.txt) test.fa


>ID_01 IMEHDJCA_03186  Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVGVLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>ID_02 IMEHDJCA_03187  Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALLAFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>ID_03 IMEHDJCA_03189  hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADISEFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD  
ADD REPLY
2
Entering edit mode
3.4 years ago
sed -f <(awk '{printf("s/^>%s/>%s %s/\n",$3,$1,$3);}' Genes_file.txt) Strain-01.faa
ADD COMMENT

Login before adding your answer.

Traffic: 2057 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6