How to rename the fasta header in a multifasta file
1
0
Entering edit mode
3.5 years ago

Hi, I have a multifasta file like the example below:

>hCoV-19/Bangladesh/BCSIR-NILMRC-523/2021|EPI_ISL_1034736|2021-01-22
CCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGG
CGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTC
>hCoV-19/Bangladesh/BCSIR-NILMRC-515/2020|EPI_ISL_1034763|2020-12-24
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACG
AATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGT
>hCoV-19/Bangladesh/BCSIR-NILMRC-517/2020|EPI_ISL_1035809|2020-12-24
GGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTG
CTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCG
CTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGAT

Here, you can see that each sequence has an id number (as like EPI_ISL_1034736) in the header. I want to keep only the id number in the header. The resulted file will be as like below:

>EPI_ISL_1034736
CCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGG
CGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTC
>EPI_ISL_1034763
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACG
AATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGT
>EPI_ISL_1035809
GGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTG
CTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCG
CTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGAT

Can any of you help me to achieve this? I can use the seqkit replace tool to rename with my own strings. but in this case, I need to keep the sequence id in the header.

fasta seqkit • 1.2k views
ADD COMMENT
0
Entering edit mode

See if solutions here help: Fasta header trimming

ADD REPLY
0
Entering edit mode
$ sed -r '/^>/ s/.*\|(.*)\|.*/>\1/' test.fa
ADD REPLY
0
Entering edit mode
3.5 years ago

You can use a regular expression to capture the ID:

seqkit seq -i --id-regexp  "\|([^\|]+)\|"

Options/Flags used:

     -i, --only-id                   print ID instead of full head

     --id-regexp string              regular expression for parsing ID (default "^(\\S+)\\s?")

Another way: replacing the whole header with captured ID.

seqkit replace -p ".+\|([^\|]+)\|.+"  -r "\$1"
ADD COMMENT

Login before adding your answer.

Traffic: 2506 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6