Question

Simulation Protein Sequences

0

Entering edit mode

12.5 years ago

User 1933 ▴ 360

Is there any software/script to generate random/artificial amino acid sequences, based on a given sequences ?

Also, is there any method to create random sequence from a bunch of aligned sequence ?! Imagine, one is interested in generating a random sequence within an enzyme family.

amino-acids sequence • 4.6k views

ADD COMMENT • link updated 12.5 years ago by Botond Sipos ★ 1.7k • written 12.5 years ago by User 1933 ▴ 360

score 1 · Answer 1 · 2012-06-05

1

Entering edit mode

12.5 years ago

JC 13k

SMS in Bioinformatics.org has "protein shuffle": http://www.bioinformatics.org/sms2/shuffle_protein.html

Emboss "shuffleseq" also can do the job: http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/shuffleseq.html

Also, creating a simple script in perl/ruby/python is easy.

Edit: my original response was for a random (shuffle) protein sequence, the question has changed to a more elaborate problem.

ADD COMMENT • link 12.5 years ago by JC 13k

0

Entering edit mode

How about enzymes families ?! is there any method to create random sequence from a bunch of aligned sequence ?! I will update the question now. Sorry about that.

ADD REPLY • link 12.5 years ago by User 1933 ▴ 360

0

Entering edit mode

oh, that's a completely different problem. What is your goal? to provide a control in multiple alignment algorithms?

ADD REPLY • link 12.5 years ago by JC 13k

0

Entering edit mode

not really, making a simulation data for proving the concept of a method. This method, is a predictive model for function prediction of protein sequences. Should I open a new question ?!

ADD REPLY • link 12.5 years ago by User 1933 ▴ 360

score 1 · Answer 2 · 2012-06-05

1

Entering edit mode

12.5 years ago

Woa ★ 2.9k

http://www.biostars.org/post/show/8656/how-to-scramble-a-sequence-using-an-existing-script-or-a-python-method/

ADD COMMENT • link 12.5 years ago by Woa ★ 2.9k

score 1 · Answer 3 · 2012-06-05

For short(ish) amino acid sequences, you could write a brief R script to do this for you. For an amino acid sequence "VARY"

library(Deducer) # this includes a whole bunch of other libraries

#get input
input <- "VARY"

#split to individual characters and get the first vector of the resulting list
sp.input <- strsplit( input, split='')[[1]]

#use deducer package to make all permutations
perms <- perm(sp.input)

#print results
print(perms)

The permutation matrix will rapidly get quite large as the number of input characters increases. The generate all alternatives is impractical. Alternatively, you repeatedly call sample() how ever many times you need.

sample(sp.input,length(sp.input), replace=FALSE)

#repeated calls for the iterative mind
num.samples <- 10
for(i in 1:num.samples)
{
  #get a random sample of equal length to input
  random.sample <- sample(sp.input,size=length(sp.input), replace=FALSE)

  #paste the letters back together, collapse to single string, and print
  random.sample <- paste(random.sample,sep='',collapse='')
  print(random.sample)
}

Another solution could be to sort your peptide sequence, do a run length encoding and divide each by the total number of residues. This would give you the probabilities of the individual amino acids. You could use this vector of probabilities in the sample() function. In this way, you would ensure that your input string never gets beyond a length of 22. The replace parameter would have to be TRUE and the size parameter would have to be set independently.

score 1 · Answer 4 · 2012-06-06

1

Entering edit mode

12.5 years ago

Botond Sipos ★ 1.7k

To answer your second question: you can use hmmbuild from the HMMER package to build a profile HMM modelling your multiple alignment and the use hmmemit to generate sequences having the characteristics of your protein family.

ADD COMMENT • link 12.5 years ago by Botond Sipos ★ 1.7k