Inserting a string after placing the last underscore in a header if a fasta file
2
0
Entering edit mode
6.1 years ago
timmers ▴ 30

I'm having trouble with my syntax in trying to remove the last '_' in the header to replace with the string ':size=' and then add a ';' after the number. Original file looks like this:

 >ART01B_100_M7_ID100005_1
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>PAG05A_100_M7_ID102325_189
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>KIN05B_100_M7_ALT_ID230005_46
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT

and am trying to make it look like the following:

>ART01B_100_M7_ID100005;size=1;
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>PAG05A_100_M7_ID102325;size=189;
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>KIN05B_100_M7_ALT_ID230005;size=46;
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT

Thanks!

sequence • 1.5k views
ADD COMMENT
0
Entering edit mode

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
0
Entering edit mode

I'm having trouble with my syntax

Please show us what you tried, we'd be happy to put you on the right track.

ADD REPLY
0
Entering edit mode

I seem to keep removing everything in the line and just replacing with ';size=. These are three of the attempts that do this: sed '/>/ s/(.)_.$/;size=/g' , sed '/>/ s/(.)_/;size=/g', sed '/>/ s/(.)_./;size=/g'. I'm missing something to indicate not to remove everything before that last _ but to replace it with ;size=. I know how to add just the ';' at the end which is sed 's/$/;/' so if the first replacement part is fixed then I can add that sed function to finish up the header but if it's possible to do all at once, great.

ADD REPLY
2
Entering edit mode
6.1 years ago
vkkodali_ncbi ★ 3.8k

Try this:

vamsi@cistron:~$ cat temp.txt
 >ART01B_100_M7_ID100005_1
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>PAG05A_100_M7_ID102325_189
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>KIN05B_100_M7_ALT_ID230005_46
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT

vamsi@cistron:~$ sed 's/\(>.*\)_\([0-9]*\)/\1;size=\2/g' temp.txt
 >ART01B_100_M7_ID100005;size=1
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>PAG05A_100_M7_ID102325;size=189
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>KIN05B_100_M7_ALT_ID230005;size=46
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT

By default, sed is 'greedy' so \1 will always include everything until the last underscore.

ADD COMMENT
0
Entering edit mode

Yes that works thanks!

ADD REPLY
1
Entering edit mode

Sorry, did not notice the trailing semi-colon; if you change the sed statement to something like this, you can add it with a single command:

sed 's/\(>.*\)_\([0-9]*\)/\1;size=\2;/g'
ADD REPLY
0
Entering edit mode

I'd recommend --extended-regexp while working with backreferences, it removes the need for backslashes in the match expression.

sed -r 's/(>.*)_([0-9]*)/\1;size=\2;/g' temp.txt
ADD REPLY
0
Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLY

Login before adding your answer.

Traffic: 2488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6