Adding sequence length to its ID
1
0
Entering edit mode
7.0 years ago
Kenny ▴ 30

Hi all,

I have a scaffold sequence named "oenopla_scaffold_112117.fa" and it has 192947 sequences.

The ID of the scaffolds are:

grep ">" oenopla_scaffold_112117.fa | head -5
>scaffold_0
>scaffold_1
>scaffold_2
>scaffold_3
>scaffold_4

And the length of the scaffolds are:

cat oenopla_scaffold_112117.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' | head -6

scaffold_0  16608
scaffold_1  14918
scaffold_2  14554
scaffold_3  14024
scaffold_4  13894

What I want to do is the add the sequence length to my IDs, so the desire output will look like:

>scaffold_0_16608
>scaffold_1_14918
>scaffold_2_14554
>scaffold_3_14024
>scaffold_4_13894
...
>scaffold_192946_1500

How can I do this?

Best,

Kenny

sequence • 1.2k views
ADD COMMENT
2
Entering edit mode
7.0 years ago

linearize and convert back to fasta using the length using awk:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa | awk -F '\t' '{printf("%s_%d\n%s\n",$1,length($2),$2);}'
ADD COMMENT
0
Entering edit mode

It works perfectly. Thank you Pierre!

ADD REPLY

Login before adding your answer.

Traffic: 2058 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6