Hi everyone,
I have some protein sequences with carriage return every line. But my program need one protein name and sequence.
Could you help me replace these carriage return?
Thanks,
Fuyou
Hi everyone,
I have some protein sequences with carriage return every line. But my program need one protein name and sequence.
Could you help me replace these carriage return?
Thanks,
Fuyou
If I'm understanding you correctly, and you have protein sequences like this:
>prtn1 DTENKRK KDFLTSE NSLPRIS
and you want
>prtn1 DTENKRKKDFLTSENSLPRISS
you could use something like this
awk 'BEGIN {ORS=""}{if ($1 ~ /^>/) print "\n"$1"\n"; else print}END{print "\n"}' <protein file>
ORS="" removes the end of line when awk prints, so the protein sequence is concatenated into one line. Checking for a line starting with ">" means the ID can be printed on its own line, by including "\n" newline characters. This can handle multiple sequences, but ends up starting with a blank line, so if your file only has one sequence you might want to replace "\n"$1"\n" with $1"\n".
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
How exactly the formats (from/to)? Why don't you post small examples?