Question

How to extract organisms name from the headers in a multi-fasta file?

1

Entering edit mode

5.8 years ago

MB ▴ 50

Hi, I am trying to extract organisms' names from the headers in a multi-fasta file named input.fa shown below:

>KZR5864_Org_name_nam_strain.11
GHTKKLACWQRTTAAFFGYYWOPPEEDSSSSLKKDDIIPFTQWENMAATGGFDMLLAAPP
>OIA4716.3_Org_other_name_bla_bla
AHHTTIPLNCCWWETRQKLLSSNNNMTIPAHGFSSLLKANCDSM
>SMAR_08120_Other_org_name_bla
AGTHHKKLAMNCWTQEREYPPILLSSDFMNCCVTTQQLAK

what I want is to obtain is the organism name in the header. I have tried the following sed command but I am unable to check for the alphanumerics, therefore, I am also getting the digits after the first underscore like in third header.

sed -eT -e 's|_|&\n|;D' input.fa > out.txt

Expected results:

Org_name_nam_strain.11
Org_other_name_bla_bla
Other_org_name_bla

Please tell me how to obtain org names only. Thanks!

Fasta Regex Sed Header • 2.9k views

ADD COMMENT • link updated 5.8 years ago by finswimmer 16k • written 5.8 years ago by MB ▴ 50

2

Entering edit mode

with sed:

$ sed -rn 's/.*(org\.*)/\1/pgi' test.txt  (or)
$ sed -n '/>/ s/.*[0-9]_//p' test.txt

Org_name_nam_strain.11
Org_other_name_bla_bla
org_name_bla

with awk:

$ awk '/>/ {sub(".*[0-9]_","",$0);print}' test.txt

Org_name_nam_strain.11
Org_other_name_bla_bla
Other_org_name_bla

ADD REPLY • link 5.8 years ago by cpad0112 21k

0

Entering edit mode

Are these all of the possible formats for your FASTA headers? I ask because the regex from either sed, Perl, or awk won't really matter if there is more variety than what you show in your example FASTA headers. The regex using any of these programs has to be exactly tailored to the input, which if I had to guess is more diverse than what you show here.

ADD REPLY • link 5.8 years ago by jean.elbers ★ 1.7k

0

Entering edit mode

Yes, these are all the possible formats for fasta headers in the input file.

ADD REPLY • link 5.8 years ago by MB ▴ 50

1

Entering edit mode

This works with your example. I can't guarantee it will work with all of the lines.

grep ">" input.fa |perl -pe "s/>\w+\.\d+\_(.+)/\1/"|perl -pe "s/>[A-Za-z0-9]+_(.+)/\1/"|perl -pe "s/[0-9]+_(.+)/\1/"

ADD REPLY • link 5.8 years ago by jean.elbers ★ 1.7k

0

Entering edit mode

Thanks, it worked! It is giving all the organisms names.

ADD REPLY • link 5.8 years ago by MB ▴ 50

1

Entering edit mode

Or:

grep '>' input.fa | sed 's/.*[0-9]_//'

Grep picks lines starting with '>' and sed removes everything before "[0-9]_" (including match).

ADD REPLY • link 5.8 years ago by ahaswer ▴ 150

0

Entering edit mode

It worked too! Thanks!

ADD REPLY • link 5.8 years ago by MB ▴ 50

score 1 · Answer 1 · 2019-02-26

1

Entering edit mode

5.8 years ago

finswimmer 16k

So the organism name is everything which follows after a number and an underscore.

$ grep -oP '(?<=[0-9]_).*' input.fa

-o forces grep to return only the match and not the whole line
-P activate perl regular expression which is needed for the positive look behind

ADD COMMENT • link 5.8 years ago by finswimmer 16k