How to correct a wrong protein extraction from FASTA file?
1
1
Entering edit mode
4.3 years ago
A_heath ▴ 170

Hi all,

I have made a script where I can extract protein sequences in a fasta file from protein IDs.

However, I have this issue that I can't solve. Here is an example of my text file containing protein IDs:

Lactococcus_104
Lactococcus_105

Here is an example of my fasta file containing protein sequences:

[...]
>Lactococcus_104
MXXXXX
>Lactococcus_105
MXXXXX
[...]
>Lactococcus_1050
MXXXXX
>Lactococcus_1051
MXXXXX
>Lactococcus_1052
MXXXXX
[...]

I extract the protein sequences with this command that works perfectly:

IFS=$'\n'; for i in $(cat IDs.txt);do  line=$(grep -nr "$i" file.fasta); if [[ ! -z $line ]];then for j in $line;do lineNumber=$(echo $j | cut -d':' -f1); sed -n "$lineNumber p" file.fasta; awk -v nb=$lineNumber 'NR > nb {if ($0 ~ ">") exit; else print $0 }' file.fasta; done;fi;done > output.fasta

But the problem is that the script "extracts" all protein IDs that contains, in my example, "104" or "105" so it also "extracts" protein sequences that I don't want like Lactococcus_1051, Lactococcus_1052, etc. instead of only Lactococcus_105.

Is there a way that I can modify my command line to specify that I strictly want the extraction of the protein IDs in my text file?

If you have any suggestions, please let me know!

Thank you in advance for your help.

fasta bash • 785 views
ADD COMMENT
1
Entering edit mode

Can you add -w to your grep?

-w, --word-regexp
              Select  only  those lines containing matches that form whole words.
ADD REPLY
0
Entering edit mode

Yes that's exactly what was missing on my command line! Thank you so much for your quick answer genomax!

ADD REPLY
3
Entering edit mode
4.3 years ago
Shalu Jhanwar ▴ 540

Have you tried grep with -w like below:

IFS=$'\n'; for i in $(cat IDs.txt);do  line=$(grep -nw "$i" file.fasta); if [[ ! -z $line ]];then for j in $line;do lineNumber=$(echo $j | cut -d':' -f1); sed -n "$lineNumber p" file.fasta; awk -v nb=$lineNumber 'NR > nb {if ($0 ~ ">") exit; else print $0 }' file.fasta; done;fi;done > output.fasta
ADD COMMENT
0
Entering edit mode

Yes, as genomax replied, I tried -w and it is now working perfectly! Thank you for your answer Shalu Jhanwar!

ADD REPLY

Login before adding your answer.

Traffic: 2090 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6