Question

Amino Acid Substitution Loop Generates Incorrect Output: No Mutations in the Output, but the same sequence the multiple times

0

Entering edit mode

8.3 years ago

jguevar • 0

Hello everyone, I am constructing a loop to generate protein variants, for future machine learning use, from the wildtypes in a FASTA file and the mutation data in a csv file. When I print the iterator i, the result is the same sequence, no amino acid changes in the sequence, and it is printed multiple times, apparently the same number of times as the number of mutation positions in that column. I have tried troubleshooting it, but cannot seem to find what is going on. Any help is appreciated. I wrote a function and a nested loop as follows:

# load package to read fats file
library(seqinr)

#FASTA file with a test sequence to try the loop
seqs <- read.fasta(file = "test.fasta", as.string = TRUE, seqtype = "AA")
s <- seqs

#CSV file with the mutation data to use to generate the variants
s2648.df <- read.csv("S2648.csv")
mut.position <- s2648.df$POSITION
mut.base <- s2648.df$MUTANT
sequence.id <- s2648.df$PDB
wt.position <- s2648.df$WILD_TYPE

site.mutation <-
  functionsequence.id, position, mut.base){
    s[[sequence.id]][position] <<- mut.base
  }

for(i in s) {
       for(j in sequence.id) {
         site.mutation <-
           functionsequence.id, position, mut.base){
             s[[sequence.id]][position] <<- mut.base 
       }
       print(i)}
  }

R • 2.3k views

ADD COMMENT • link updated 8.3 years ago by GenoMax 151k • written 8.3 years ago by jguevar • 0

1

Entering edit mode

I would add a snippet of the contents of the files and what the expected output is. At first glance, besides Petr's answer/comment, you are assigning the output of function() to a variable site.mutation in every iteration of the loop. Basically this means you are creating a function called site.mutation but doing nothing with it. At the end you print the variable i, which is the sequence you passed, unchanged. So the output you described seems reasonable given the current implementation.

ADD REPLY • link 8.3 years ago by ddiez ★ 2.0k

0

Entering edit mode

Hello and thank you ddiez and Petr for your comments. Here is a snippet of the files as well as the expected output:

seqs # file containing 68 FASTA formatted amino acid sequences
$`1A23:A|PDBID|CHAIN|SEQUENCE`
[1] "AQYEDGKQYTTLEKPVAGAPQVLEFFSFFCPHCYQFEEVLHISDNVKKKLPEGVKMTKYHVNFMGGDLGKDLTQAWAVAMALGVEDKVTVPLFEGVQKTQTIRSASDIRDVFINAGIKGEEYDAAWNSFVVKSLVAQQEKAAADVQLRGVPAMFVNGKYQLNPQGMDTSNMDVFVQQYADTVKYLSEKK"
attr(,"name")
[1] "1A23:A|PDBID|CHAIN|SEQUENCE"
attr(,"Annot")
[1] ">1A23:A|PDBID|CHAIN|SEQUENCE"
attr(,"class")
[1] "SeqFastaAA"

$`1AAR:A|PDBID|CHAIN|SEQUENCE`
[1] "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG"
attr(,"name")
[1] "1AAR:A|PDBID|CHAIN|SEQUENCE"
attr(,"Annot")
[1] ">1AAR:A|PDBID|CHAIN|SEQUENCE"
attr(,"class")
[1] "SeqFastaAA"
...
$`5CRO:A|PDBID|CHAIN|SEQUENCE`
[1] "MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA"
attr(,"name")
[1] "5CRO:A|PDBID|CHAIN|SEQUENCE"
attr(,"Annot")
[1] ">5CRO:A|PDBID|CHAIN|SEQUENCE"
attr(,"class")
[1] "SeqFastaAA"

s1676.df # a data frame containing a list of PDB IDs, Wildtype Positions, , Mutation Positions, Mutated Amino Acids and other Data

proteinId   wildtype    position    mutant  ss_asa        pH    temperature ddG class
1A23         H    32              S         Helix_57.4  7   30                   5.2      IS
1A23            H     32               Y            Helix_57.4  7   30                   6.8   IS
1AAR            K     27               Q            Helix_16.1  5   25                  -1.91  DS
...
5CRO             Y    26                W           Coil_162.9  7   45                   -0.1   SC

The expected output would be that for each wildtype sequence in the FASTA file "seqs" that based on the mutation information in the "s1676.df" data frame the different mutant sequences could be generated. So for 1A23:

>1A23
AQYEDGKQYTTLEKPVAGAPQVLEFFSFFCPSCYQFEEVLHISDNVKKKLPEGVKMTKYHVNFMGGDLGKDLTQAWAVAMALGVEDKVTVPLFEGVQKTQTIRSASDIRDVFINAGIKGEEYDAAWNSFVVKSLVAQQEKAAADVQLRGVPAMFVNGKYQLNPQGMDTSNMDVFVQQYADTVKYLSEKK

>1A23H32S
AQYEDGKQYTTLEKPVAGAPQVLEFFSFFCPSCYQFEEVLHISDNVKKKLPEGVKMTKYHVNFMGGDLGKDLTQAWAVAMALGVEDKVTVPLFEGVQKTQTIRSASDIRDVFINAGIKGEEYDAAWNSFVVKSLVAQQEKAAADVQLRGVPAMFVNGKYQLNPQGMDTSNMDVFVQQYADTVKYLSEKK

>1A34H32Y
AQYEDGKQYTTLEKPVAGAPQVLEFFSFFCPYCYQFEEVLHISDNVKKKLPEGVKMTKYHVNFMGGDLGKDLTQAWAVAMALGVEDKVTVPLFEGVQKTQTIRSASDIRDVFINAGIKGEEYDAAWNSFVVKSLVAQQEKAAADVQLRGVPAMFVNGKYQLNPQGMDTSNMDVFVQQYADTVKYLSEKK

This would be done for every entry across the and for their corresponding mutations.

ADD REPLY • link updated 8.3 years ago by GenoMax 151k • written 8.3 years ago by jguevar • 0

1

Entering edit mode

8.3 years ago

Petr Ponomarenko ★ 2.8k

Never used this library. Why you do not have "(" in

functionsequence.id, position, mut.base)

It appears twice in your code

ADD COMMENT • link 8.3 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

And you do not change i in your loop. Why should it print(i) different things?

ADD REPLY • link 8.3 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Hello Petr, Thank you for your reply. Sorry about the delay in answering. You are right in your comments. I have done some work and figured out certain issue with my function. Foremost the FASTA sequences needed to be handled as strings so in also loaded the package "stringer". I have added the new code: I am still unable to generate the mutants from the sequences although the function when tested is capable of generated a mutant on its own.

library(seqinr) # this package allows for the reading and writing of FASTA files in R. Substituting Biostrings which was causing #me errors
library(stringr)

seqs <- read.fasta(file = "blast_clust_s1676.fasta", as.string = TRUE, seqtype = "AA")
seqs

s1676.df <- read.csv("s1676_features_and_labels.csv")
mut.position <- s1676.df$position
mut.base <- s1676.df$mutant
id <- s1676.df$proteinId
wt.position <- s1676.df$wildtype

site.mutatation <-
  function(id, position, mut.base){
    prefix = str_sub(seqs[[id]][1], start = 1, end = position-1)
    suffix = str_sub(seqs[[id]][1], start = position+1, end = nchar(seqs[[id]][1]))
    new_seq = str_c(prefix, mut.base, suffix)
    return(new_seq)
    }

for(i in 1:length(seqs)) {
       for(j in mut.position) {
         site.mutatation(id = i, position = j, mut.base = mut.base) 
         print(j)
    }
  }

ADD REPLY • link updated 8.3 years ago by GenoMax 151k • written 8.3 years ago by jguevar • 0

0

Entering edit mode

We as a community will be able to help you faster if instead of showing us code and asking why it is not working, you just say what are the inputs with examples and what are the outputs with examples.

ADD REPLY • link 8.3 years ago by Petr Ponomarenko ★ 2.8k

score 3 · Accepted Answer · 2017-02-22

Here is the answer:

Let's assume your original fasta and csv files were like this:

FASTA:

>1A23:A|PDBID|CHAIN|SEQUENCE
AQYEDGKQYTTLEKPVAGAPQVLEFFSFFCPHCYQFEEVLHISDNVKKKLPEGVKMTKYH
VNFMGGDLGKDLTQAWAVAMALGVEDKVTVPLFEGVQKTQTIRSASDIRDVFINAGIKGE
EYDAAWNSFVVKSLVAQQEKAAADVQLRGVPAMFVNGKYQLNPQGMDTSNMDVFVQQYAD
TVKYLSEKK
>1AAR:A|PDBID|CHAIN|SEQUENCE
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYN
IQKESTLHLVLRLRGG
>5CRO:A|PDBID|CHAIN|SEQUENCE
MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPS
NKKTTA

CSV was space delimited (field separator can be changed with like BEGIN{FS=";"} it does not really matter):

proteinId wildtype position mutant ss_asa pH temperature ddG class
1A23 H 32 S Helix_57.4 7 30 5.2 IS
1A23 H 32 Y Helix_57.4 7 30 6.8 IS
1AAR K 27 Q Helix_16.1 5 25 -1.91 DS
5CRO Y 26 W Coil_162.9 7 45 -0.1 SC

The first solution is with AWK one-liner (for me it is much faster then in R because awk was designed for strings processing). I can make it in R if you have to have it in R.

awk '(NR==FNR&&$0~">"){name=$1;sub(">","",name);sub(/[:].*/,"",name);id[name]=name} \
  (NR==FNR&&$0!~">"){seq[name]=seq[name]""$0} \
  (NR!=FNR&&$1!="proteinId"){print ">"$1".p"$2$3$4;tmp=seq[$1]; \
  if(substr(tmp,$3,1)==$2){print substr(tmp,1,$3-1)""$4""substr(tmp,$3+1)} \
  else{print "Warning! Wrong wild type aminoacid (skipped line): "$0}}' \
  input.fasta input.csv > output.fasta

I added ">" in front of your mutated sequences so it is according to FASTA standard format. Technically you may want to split your sequence lines by any length up to 120 to be fully compliant. Also, I separated PDB ID from mutation information with ".p" so it is more readable and closer to popular standards like HGVS notation.

AWK output.fasta:

>1A23.pH32S
AQYEDGKQYTTLEKPVAGAPQVLEFFSFFCPSCYQFEEVLHISDNVKKKLPEGVKMTKYHVNFMGGDLGKDLTQAWAVAMALGVEDKVTVPLFEGVQKTQTIRSASDIRDVFINAGIKGEEYDAAWNSFVVKSLVAQQEKAAADVQLRGVPAMFVNGKYQLNPQGMDTSNMDVFVQQYADTVKYLSEKK
>1A23.pH32Y
AQYEDGKQYTTLEKPVAGAPQVLEFFSFFCPYCYQFEEVLHISDNVKKKLPEGVKMTKYHVNFMGGDLGKDLTQAWAVAMALGVEDKVTVPLFEGVQKTQTIRSASDIRDVFINAGIKGEEYDAAWNSFVVKSLVAQQEKAAADVQLRGVPAMFVNGKYQLNPQGMDTSNMDVFVQQYADTVKYLSEKK
>1AAR.pK27Q
MQIFVKTLTGKTITLEVEPSDTIENVQAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
>5CRO.pY26W
MEQRITLKDYAMRFGQTKTAKDLGVWQSAINKAIHAGRKIFLTINADGSVYAEEVKPFPSNKKTTA

AWK reads files line by line. When NR=FNR it is in the first file. It populates arrays in memory for your sequences from fasta file. Then it goes to the second file and for each lines prints out mutated one sequence. This is good for single amino acid substitutions. For insertions and deletions, you might want change code a little by introducing length($2) into your code for the length of the change.

If you have to have it in R please let us know.