I am relatively new to Linux, and I have read through this post: Fasta header trimming , but it does not quite solve my problem.
This is the format of the sequences in my file:
>sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1
.. followed by the amino acid sequence.
I would like the format to be:
>P48347
+ sequence
As you can see, there are multiple delimiters, and I'm struggling to extract the characters I want correctly.
So far, my code is:
$ cut -d ' ' -f 1 | cut -d '|' -f 2 example.fasta > out.fasta
Which outputs:
P48347
+ sequence
I considered using sed to add the ">" back, but this seems a bit messy. I have also tried awk, but I am confused by how to use it with multiple delimiters and fasta format.
How do I extract the unique identifier in the header (P48347), without losing the '>' at the beginning?
Thanks in advance.
Thanks, this works perfectly!