How to edit the fasta headers in a multiline fasta file?
1
0
Entering edit mode
2.4 years ago
pinn ▴ 210

Hi,

I had 1000's of sequences in a fasta file. I'd like to delete the underscore and number (_1,_2,_34297...). at the end of the fasta headers ?

Original file

>XP_034398789.1_1
>XP_034398430.1_2
....
....
....
>XP_034381508.1_34297
>XP_034419373.1_34330
>XP_034419129.1_34363
>XP_034385161.1_38667

Expected output

>XP_034398789.1
>XP_034398430.1
....
....
....
>XP_034381508.1
>XP_034419373.1
>XP_034419129.1

Using , cut, I tried on sample data. It deletes the ">XP_" What I'd be cut command for deleting the characters/numbers after the XP_034398789.1_1

 cut -f2 -d'_' TEXT.fa.fa | sed '15~20s/^/>/'

 034419421.1
 034380977.1
 034381532.1

cut -d_ -f1,2 TEXT.fa.fa

    >XP_034398789.1
    >XP_034398430.1
    ....
    ....
    ....
    >XP_034381508.1
    >XP_034419373.1
    >XP_034419129.1
gene genome protein • 828 views
ADD COMMENT
2
Entering edit mode

There are plenty of fasta-header-editing posts on the forum (I'm sure you would have seen a few in the years you've been here), and "delete everything after second underscore" will produce a ton of Google results. Did you try searching anywhere before creating a new post?

ADD REPLY
0
Entering edit mode
$ awk -F "_" '/^>/{print $1"_"$2};!/>/' test.fa
$ sed -r '/^>/ s/_\w+//2' test.fa
ADD REPLY
2
Entering edit mode
2.4 years ago

A seqkit answer for posterity.

seqkit replace -p "_\d+$" file.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 1595 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6