Hello All,
I would like to sort the fasta header line (annotation). Below is the example of how my data is and it is in .txt
>AHF21055.1 ribosomal protein S4 (mitochondrion) [Helianthus annuus]
>AAM96597.1 ATP synthase F0 subunit 6 (mitochondrion) [Chaetosphaeridium globosum]
>AAM96598.1 ATP synthase F0 subunit 8 (mitochondrion) [Chaetosphaeridium globosum]
>AAM96599.1 ATP synthase F0 subunit 9 (mitochondrion) [Chaetosphaeridium globosum]
I would like to get the data as below: just the accession number and protein name preferably in table format and remove everything after the protein name.
example:
>AHF21055.1 ribosomal protein S4
>AAM96597.1 ATP synthase F0 subunit 6
>AAM96598.1 ATP synthase F0 subunit 8
>AAM96599.1 ATP synthase F0 subunit 9
Thank you in advance!
Assuming the (mitochondrion) is always there, this is what I can think on the of my head
cut -f1 -d'(' header.txt | sort
. There will be an empty space at the end and can be removed bysed 's/ *$//'
.thank you Eric Lim for your reply!
What have you tried?
PS: Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Sure Ram will do that from next time. Thanks a lot! I am kinda new to this forum
Do all of your entries follow that format? Will there be some where the string
(mitochondrion)
is not there?