Hi all,
I have made a script where I can extract protein sequences in a fasta file from protein IDs.
However, I have this issue that I can't solve. Here is an example of my text file containing protein IDs:
Lactococcus_104
Lactococcus_105
Here is an example of my fasta file containing protein sequences:
[...]
>Lactococcus_104
MXXXXX
>Lactococcus_105
MXXXXX
[...]
>Lactococcus_1050
MXXXXX
>Lactococcus_1051
MXXXXX
>Lactococcus_1052
MXXXXX
[...]
I extract the protein sequences with this command that works perfectly:
IFS=$'\n'; for i in $(cat IDs.txt);do line=$(grep -nr "$i" file.fasta); if [[ ! -z $line ]];then for j in $line;do lineNumber=$(echo $j | cut -d':' -f1); sed -n "$lineNumber p" file.fasta; awk -v nb=$lineNumber 'NR > nb {if ($0 ~ ">") exit; else print $0 }' file.fasta; done;fi;done > output.fasta
But the problem is that the script "extracts" all protein IDs that contains, in my example, "104" or "105" so it also "extracts" protein sequences that I don't want like Lactococcus_1051, Lactococcus_1052, etc. instead of only Lactococcus_105.
Is there a way that I can modify my command line to specify that I strictly want the extraction of the protein IDs in my text file?
If you have any suggestions, please let me know!
Thank you in advance for your help.
Can you add
-w
to yourgrep
?Yes that's exactly what was missing on my command line! Thank you so much for your quick answer genomax!