Naming of FastA sequences
1
0
Entering edit mode
10.5 years ago
Phil S. ▴ 700

Hi there,

I have a usual looking fastA file like this:

>Translation: 2..112 (direct), 37 amino acids
ADTAQEFISTAVFGTSMSAHHILGLKPVPRVWLFAI*

>Translation: 1482..1790 (direct), 103 amino acids
MKKYTEQAKLSVVEDYCSGSAGHREVAHRHGVNANVIRKWLPIYRDKPVAPLPAFVPLQP
MPKRQADEAVVIALSLGDKSITVKWPISDPDGCARFIRSLSQ*
 
>Translation: 1787..2122 (direct), 112 amino acids
MIRIDAIWLATEPMDMRAGTETALVRVVAVFGAAKPHCAYLFANRRANRMKVLVHDGVGI
WLAARRLNQGKFHWPGTHRGLEVGLDAEQLQALVLGLPWQRVGANGAITMI*

now what I want to do is to kind of rename the sequences with a number which has to be 5 digits long. That means the three sequences above should be named like this:

>orf00001 2..112 (direct), 37 amino acids
ADTAQEFISTAVFGTSMSAHHILGLKPVPRVWLFAI*


>orf00002 1482..1790 (direct), 103 amino acids
MKKYTEQAKLSVVEDYCSGSAGHREVAHRHGVNANVIRKWLPIYRDKPVAPLPAFVPLQP
MPKRQADEAVVIALSLGDKSITVKWPISDPDGCARFIRSLSQ*
 
>orf00003 1787..2122 (direct), 112 amino acids
MIRIDAIWLATEPMDMRAGTETALVRVVAVFGAAKPHCAYLFANRRANRMKVLVHDGVGI
WLAARRLNQGKFHWPGTHRGLEVGLDAEQLQALVLGLPWQRVGANGAITMI*

So the 5 digits are fixed and I just need to count upwards seeing the '>' unfortunately I don't know how to make it a fixed length onto five digits.

Thanks for your help (once again ;) )

Best,

Phil

fasta bash python • 2.0k views
ADD COMMENT
3
Entering edit mode
10.5 years ago

Something like the following should get you close:

$ awk ' \
    BEGIN { idx = 0; } \
    if (/^>/) { \
        printf(">orf%05d %s\n", idx, substr($1, 2)); \
        idx++; \
    } \
    else { \
        print $0; \
    } \
​' mySeqs.fa > myRelabeledSeqs.fa
ADD COMMENT
0
Entering edit mode

Thank you so much for the fast and correct answer. The only thing I had to adjust is a line brake...

ADD REPLY

Login before adding your answer.

Traffic: 1791 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6