Question

Adding sequence length to its ID

0

Entering edit mode

7.0 years ago

Kenny ▴ 30

Hi all,

I have a scaffold sequence named "oenopla_scaffold_112117.fa" and it has 192947 sequences.

The ID of the scaffolds are:

grep ">" oenopla_scaffold_112117.fa | head -5
>scaffold_0
>scaffold_1
>scaffold_2
>scaffold_3
>scaffold_4

And the length of the scaffolds are:

cat oenopla_scaffold_112117.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' | head -6

scaffold_0  16608
scaffold_1  14918
scaffold_2  14554
scaffold_3  14024
scaffold_4  13894

What I want to do is the add the sequence length to my IDs, so the desire output will look like:

>scaffold_0_16608
>scaffold_1_14918
>scaffold_2_14554
>scaffold_3_14024
>scaffold_4_13894
...
>scaffold_192946_1500

How can I do this?

Best,

Kenny

sequence • 1.2k views

ADD COMMENT • link updated 7.0 years ago by Pierre Lindenbaum 164k • written 7.0 years ago by Kenny ▴ 30

score 2 · Accepted Answer · 2017-12-12

2

Entering edit mode

7.0 years ago

Pierre Lindenbaum 164k

linearize and convert back to fasta using the length using awk:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa | awk -F '\t' '{printf("%s_%d\n%s\n",$1,length($2),$2);}'