Entering edit mode
5.8 years ago
MB
▴
50
Hi, I am trying to extract organisms' names from the headers in a multi-fasta file named input.fa shown below:
>KZR5864_Org_name_nam_strain.11
GHTKKLACWQRTTAAFFGYYWOPPEEDSSSSLKKDDIIPFTQWENMAATGGFDMLLAAPP
>OIA4716.3_Org_other_name_bla_bla
AHHTTIPLNCCWWETRQKLLSSNNNMTIPAHGFSSLLKANCDSM
>SMAR_08120_Other_org_name_bla
AGTHHKKLAMNCWTQEREYPPILLSSDFMNCCVTTQQLAK
what I want is to obtain is the organism name in the header. I have tried the following sed command but I am unable to check for the alphanumerics, therefore, I am also getting the digits after the first underscore like in third header.
sed -eT -e 's|_|&\n|;D' input.fa > out.txt
Expected results:
Org_name_nam_strain.11
Org_other_name_bla_bla
Other_org_name_bla
Please tell me how to obtain org names only. Thanks!
with sed:
with awk:
Are these all of the possible formats for your FASTA headers? I ask because the regex from either
sed
,Perl
, orawk
won't really matter if there is more variety than what you show in your example FASTA headers. The regex using any of these programs has to be exactly tailored to the input, which if I had to guess is more diverse than what you show here.Yes, these are all the possible formats for fasta headers in the input file.
This works with your example. I can't guarantee it will work with all of the lines.
Thanks, it worked! It is giving all the organisms names.
Or:
Grep picks lines starting with '>' and sed removes everything before "[0-9]_" (including match).
It worked too! Thanks!