Question

need code for sorting fasta header

0

Entering edit mode

6.1 years ago

divyaranib.10 • 0

Hello All,

I would like to sort the fasta header line (annotation). Below is the example of how my data is and it is in .txt

>AHF21055.1 ribosomal protein S4 (mitochondrion) [Helianthus annuus]
>AAM96597.1 ATP synthase F0 subunit 6 (mitochondrion) [Chaetosphaeridium globosum]
>AAM96598.1 ATP synthase F0 subunit 8 (mitochondrion) [Chaetosphaeridium globosum]
>AAM96599.1 ATP synthase F0 subunit 9 (mitochondrion) [Chaetosphaeridium globosum]

I would like to get the data as below: just the accession number and protein name preferably in table format and remove everything after the protein name.

example:

>AHF21055.1     ribosomal protein S4
>AAM96597.1     ATP synthase F0 subunit 6
>AAM96598.1     ATP synthase F0 subunit 8
>AAM96599.1     ATP synthase F0 subunit 9

Thank you in advance!

fasta sequence • 1.3k views

ADD COMMENT • link updated 18 months ago by Ram 44k • written 6.1 years ago by divyaranib.10 • 0

1

Entering edit mode

Assuming the (mitochondrion) is always there, this is what I can think on the of my head cut -f1 -d'(' header.txt | sort. There will be an empty space at the end and can be removed by sed 's/ *$//'.

ADD REPLY • link 6.1 years ago by Eric Lim ★ 2.2k

0

Entering edit mode

thank you Eric Lim for your reply!

ADD REPLY • link 6.1 years ago by divyaranib.10 • 0

0

Entering edit mode

What have you tried?

PS: Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLY • link 6.1 years ago by Ram 44k

0

Entering edit mode

Sure Ram will do that from next time. Thanks a lot! I am kinda new to this forum

ADD REPLY • link 6.1 years ago by divyaranib.10 • 0

0

Entering edit mode

Do all of your entries follow that format? Will there be some where the string (mitochondrion) is not there?

ADD REPLY • link 6.1 years ago by Joe 21k

1

Entering edit mode

6.1 years ago

Ram 44k

From my experience, FASTA headers consist of two parts - the ID and the description. You can use a tool like bioawk to extract just the identifier and then sort the output, or you can use any combination of command line utilities, such as grep -o or cut or sed, much like Eric Lim's comment.

ADD COMMENT • link 6.1 years ago by Ram 44k

score 1 · Accepted Answer · 2018-10-24

1

Entering edit mode

6.1 years ago

n,n ▴ 370

This gives me the exact output you want as long as (mitochondrion) is present in all lines:

cat old_fasta_headers | sed '/^[[:space:]]*$/d' | cut -d\( -f1 | sed 's/\(\.[[:digit:]]*\) /\1\t/g ; s/$/\n/g' \
> new_fasta_headers

Hope this helps.

ADD COMMENT • link 6.1 years ago by n,n ▴ 370

0

Entering edit mode

thanks a lot mike!! it solved my problem!

ADD REPLY • link 6.1 years ago by divyaranib.10 • 0