Editing header of a fasta file
3
0
Entering edit mode
7.4 years ago
DVR ▴ 30

Hello everybody, I am not and expert with sed and I am sure that someone will do this work faster and better than me.

I would like to edit multiple fasta header from this format.

>M01380:50:000000000-AV1DH:1:1101:16094:3001 1:N:0:M636:16S_V1V3 TTCTGCCT|0|TAGACCTA|0 CS1_534R_YM3_for|3|27|

to this one:

>M636

As you can see "M636" is embedded in the mayor header.

Thank you for always helping everybody!

D.

header edition fasta • 7.6k views
ADD COMMENT
2
Entering edit mode

Sure. But did you try anything?

Just a comment:

someone will do this work faster and better than me

There will always be people ahead of us. But this shouldn't hinder our learning :)

ADD REPLY
0
Entering edit mode

Hello @venu

Sure, I am trying with sed (since awk seems to be more complicated). This post from @noirot.celine gave me some ideas of what to do ( A: Renaming Entries In A Fasta File ). But I am having a hard time figuring out the regular expressions.

I understand your concern. I will keep trying... :)

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

ADD REPLY
0
Entering edit mode

Thank you all for your solutions,

Seidel's proposal worked beautifully!

Thanks also @novice, however maybe I did not explain well myself but the "M" may change to other letter or not be present.

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY
2
Entering edit mode
7.4 years ago
seidel 11k

If your field of interest is always in the same position, you can try awk with the split() function and do something like the following:

awk '{split($0,a,":"); if(a[10]) print ">"a[10]; else print; }' yourfile.fasta > newfile.fasta

This translates to: split the input line by semicolon and put the results into array a, if a is defined, print a modified version using the 10th field, if not just print the input line.

ADD COMMENT
0
Entering edit mode

Marked as accepted because OP didn't...

ADD REPLY
1
Entering edit mode
7.4 years ago
novice ★ 1.1k

I think the answer above is safe and correct. Here's a sed solution that is less safe (we don't know what is changing between the headers):

$ cat old.fasta | sed 's/>.*:\(M[0-9]*\):.*/>\1/' > new.fasta

Edit: excluded colons from headers.

ADD COMMENT
0
Entering edit mode
7.4 years ago
5heikki 11k
cut -f1 -d ":" in.fa > out.fa

edit. nevermind, thought you wanted the first "field"..

ADD COMMENT

Login before adding your answer.

Traffic: 1952 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6