Question

How to correct a wrong protein extraction from FASTA file?

1

Entering edit mode

4.3 years ago

A_heath ▴ 170

Hi all,

I have made a script where I can extract protein sequences in a fasta file from protein IDs.

However, I have this issue that I can't solve. Here is an example of my text file containing protein IDs:

Lactococcus_104
Lactococcus_105

Here is an example of my fasta file containing protein sequences:

[...]
>Lactococcus_104
MXXXXX
>Lactococcus_105
MXXXXX
[...]
>Lactococcus_1050
MXXXXX
>Lactococcus_1051
MXXXXX
>Lactococcus_1052
MXXXXX
[...]

I extract the protein sequences with this command that works perfectly:

IFS=$'\n'; for i in $(cat IDs.txt);do  line=$(grep -nr "$i" file.fasta); if [[ ! -z $line ]];then for j in $line;do lineNumber=$(echo $j | cut -d':' -f1); sed -n "$lineNumber p" file.fasta; awk -v nb=$lineNumber 'NR > nb {if ($0 ~ ">") exit; else print $0 }' file.fasta; done;fi;done > output.fasta

But the problem is that the script "extracts" all protein IDs that contains, in my example, "104" or "105" so it also "extracts" protein sequences that I don't want like Lactococcus_1051, Lactococcus_1052, etc. instead of only Lactococcus_105.

Is there a way that I can modify my command line to specify that I strictly want the extraction of the protein IDs in my text file?

If you have any suggestions, please let me know!

Thank you in advance for your help.

fasta bash • 782 views

ADD COMMENT • link updated 14 months ago by Ram 44k • written 4.3 years ago by A_heath ▴ 170

1

Entering edit mode

Can you add -w to your grep?

-w, --word-regexp
              Select  only  those lines containing matches that form whole words.

ADD REPLY • link 4.3 years ago by GenoMax 147k

0

Entering edit mode

Yes that's exactly what was missing on my command line! Thank you so much for your quick answer genomax!

ADD REPLY • link 4.3 years ago by A_heath ▴ 170

score 3 · Accepted Answer · 2020-08-13

3

Entering edit mode

4.3 years ago

Shalu Jhanwar ▴ 540

Have you tried grep with -w like below:

IFS=$'\n'; for i in $(cat IDs.txt);do  line=$(grep -nw "$i" file.fasta); if [[ ! -z $line ]];then for j in $line;do lineNumber=$(echo $j | cut -d':' -f1); sed -n "$lineNumber p" file.fasta; awk -v nb=$lineNumber 'NR > nb {if ($0 ~ ">") exit; else print $0 }' file.fasta; done;fi;done > output.fasta

ADD COMMENT • link 4.3 years ago by Shalu Jhanwar ▴ 540

0

Entering edit mode

Yes, as genomax replied, I tried -w and it is now working perfectly! Thank you for your answer Shalu Jhanwar!

ADD REPLY • link 4.3 years ago by A_heath ▴ 170