Here is the answer:
Let's assume your original fasta and csv files were like this:
FASTA:
>1A23:A|PDBID|CHAIN|SEQUENCE
AQYEDGKQYTTLEKPVAGAPQVLEFFSFFCPHCYQFEEVLHISDNVKKKLPEGVKMTKYH
VNFMGGDLGKDLTQAWAVAMALGVEDKVTVPLFEGVQKTQTIRSASDIRDVFINAGIKGE
EYDAAWNSFVVKSLVAQQEKAAADVQLRGVPAMFVNGKYQLNPQGMDTSNMDVFVQQYAD
TVKYLSEKK
>1AAR:A|PDBID|CHAIN|SEQUENCE
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYN
IQKESTLHLVLRLRGG
>5CRO:A|PDBID|CHAIN|SEQUENCE
MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPS
NKKTTA
CSV was space delimited (field separator can be changed with like BEGIN{FS=";"} it does not really matter):
proteinId wildtype position mutant ss_asa pH temperature ddG class
1A23 H 32 S Helix_57.4 7 30 5.2 IS
1A23 H 32 Y Helix_57.4 7 30 6.8 IS
1AAR K 27 Q Helix_16.1 5 25 -1.91 DS
5CRO Y 26 W Coil_162.9 7 45 -0.1 SC
The first solution is with AWK one-liner (for me it is much faster then in R because awk was designed for strings processing). I can make it in R if you have to have it in R.
awk '(NR==FNR&&$0~">"){name=$1;sub(">","",name);sub(/[:].*/,"",name);id[name]=name} \
(NR==FNR&&$0!~">"){seq[name]=seq[name]""$0} \
(NR!=FNR&&$1!="proteinId"){print ">"$1".p"$2$3$4;tmp=seq[$1]; \
if(substr(tmp,$3,1)==$2){print substr(tmp,1,$3-1)""$4""substr(tmp,$3+1)} \
else{print "Warning! Wrong wild type aminoacid (skipped line): "$0}}' \
input.fasta input.csv > output.fasta
I added ">" in front of your mutated sequences so it is according to FASTA standard format. Technically you may want to split your sequence lines by any length up to 120 to be fully compliant. Also, I separated PDB ID from mutation information with ".p" so it is more readable and closer to popular standards like HGVS notation.
AWK output.fasta:
>1A23.pH32S
AQYEDGKQYTTLEKPVAGAPQVLEFFSFFCPSCYQFEEVLHISDNVKKKLPEGVKMTKYHVNFMGGDLGKDLTQAWAVAMALGVEDKVTVPLFEGVQKTQTIRSASDIRDVFINAGIKGEEYDAAWNSFVVKSLVAQQEKAAADVQLRGVPAMFVNGKYQLNPQGMDTSNMDVFVQQYADTVKYLSEKK
>1A23.pH32Y
AQYEDGKQYTTLEKPVAGAPQVLEFFSFFCPYCYQFEEVLHISDNVKKKLPEGVKMTKYHVNFMGGDLGKDLTQAWAVAMALGVEDKVTVPLFEGVQKTQTIRSASDIRDVFINAGIKGEEYDAAWNSFVVKSLVAQQEKAAADVQLRGVPAMFVNGKYQLNPQGMDTSNMDVFVQQYADTVKYLSEKK
>1AAR.pK27Q
MQIFVKTLTGKTITLEVEPSDTIENVQAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
>5CRO.pY26W
MEQRITLKDYAMRFGQTKTAKDLGVWQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA
AWK reads files line by line. When NR=FNR it is in the first file. It populates arrays in memory for your sequences from fasta file. Then it goes to the second file and for each lines prints out mutated one sequence. This is good for single amino acid substitutions. For insertions and deletions, you might want change code a little by introducing length($2) into your code for the length of the change.
If you have to have it in R please let us know.
I would add a snippet of the contents of the files and what the expected output is. At first glance, besides Petr's answer/comment, you are assigning the output of
function()
to a variablesite.mutation
in every iteration of the loop. Basically this means you are creating a function calledsite.mutation
but doing nothing with it. At the end you print the variablei
, which is the sequence you passed, unchanged. So the output you described seems reasonable given the current implementation.Hello and thank you ddiez and Petr for your comments. Here is a snippet of the files as well as the expected output:
s1676.df # a data frame containing a list of PDB IDs, Wildtype Positions, , Mutation Positions, Mutated Amino Acids and other Data
The expected output would be that for each wildtype sequence in the FASTA file "seqs" that based on the mutation information in the "s1676.df" data frame the different mutant sequences could be generated. So for 1A23:
This would be done for every entry across the and for their corresponding mutations.