Question

How To Randomly Select 2 Sequences From Fasta Format In Perl?

0

Entering edit mode

13.2 years ago

Smandape ▴ 120

Hello Experts,

I am handling sequences in FASTA format. I used bioperl to parse the sequences. Now, I have to select any two sequences randomly at a time and use it to find the unknown motif. My initial aim is to choose any sequences randomly. I read that function rand() is used to do randomization. But my concern is how can I do it using function rand()? As rand() uses array for randomization and I cannot put the sequences in an array as I think it will not be efficient. Also, does rand() chooses the sequences only once or it may choose the sequences more than once randomly? This code is suppose to run on any number of sequences with a fasta file as input.

My aim is to follow the greedy algorithm approach from the book An Introduction to Bioinformatics Algorithm by Jones and Pevzner. Here they have used only first two sequences to find (s1,s2) and then rest of the sequences compared with these two vectors to find the score. In my homework problem I have to choose the pair of sequences randomly. The code is as follows:

GREEDYMOTIFSEARCH(DNA, t, n, l)
1 bestMotif ← (1, 1, . . . , 1)
2 s ← (1, 1, . . . , 1)
3 for s1 ← 1 to n − l + 1
4  for s2 ← 1 to n − l + 1
5   if Score(s, 2,DNA) > Score(bestMotif , 2,DNA)
6     BestMotif1 ← s1
7     BestMotif2 ← s2
8 s1 ← BestMotif1
9 s2 ← BestMotif2
10 for i ← 3 to t
11   for si ← 1 to n − l + 1
12     if Score(s, i,DNA) > Score(bestMotif, i,DNA)
13        bestMotifi ← si
14   si ← bestMotifi
15 return bestMotif

where l is the length of the motif I have to find(this is taken from user), n is the number of nucleotide in a single sequence and t are the number of sequences. Thank you for looking at this question. Any help is greatly appreciated. Let me know if my question is not put in a clear way and I will try to rewrite it in other words.

homework algorithm random sequence fasta • 4.8k views

ADD COMMENT • link updated 13.2 years ago by Chris Penkett ▴ 490 • written 13.2 years ago by Smandape ▴ 120

score 3 · Answer 1 · 2011-10-26

3

Entering edit mode

13.2 years ago

Pierre Lindenbaum 164k

not perl, but using a simple command line:

awk '/^>/ {printf("\n%s\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' < seq.fasta  |\ #linearizze
egrep -v '^$' |\ #no blank line
shuf |\ #shuffle
head -n 2 |\ #two line
tr "\t" "\n" #restore fasta

ADD COMMENT • link 13.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

@Pierre: Thank you for your answer. But I have to use this in my PERL program as I have to do these randomization at least 20 times. Is it not possible in PERL?

ADD REPLY • link 13.2 years ago by Smandape ▴ 120

score 3 · Answer 2 · 2011-10-26

The typical way you would go about it is like this:

You parse the FASTA file into some data structure. This could be a hash table in which you have the name of the sequence as key and the sequence as value. Or if you do not care about a name, it could just be an array in which each element is a sequence.
You find the number of elements (N) in you data structure (i.e. the number of keys in your hash or the length of your array).
You use rand(N) to generate a random floating point in the interval [0;N) and typecast it to an integer. Call that number i.
You fetch sequence number i from your data structure.
You repeat steps 3 and 4 to get a second sequence (making sure that you do not pick the same one twice).

Nothing Perl specific about this solution; this is how you would generally go about it. Unless you resort to nifty command-line hacks like what Pierre suggested, of course. ;-)

Ram · Answer 3 · 2011-10-26

A couple of hints for your homework ;-)

Put the sequence names into an array and then use BioPerl indexing to retrieve the actual sequence from a sequence name:

http://www.bioperl.org/wiki/Module:Bio::Index::Fasta

This perl one-liner uses the rand() function to get a random sequence name from a fasta file. If you want two, you'll need to be careful you don't get the same sequence twice.

perl -lne 'push @seq, $_ if /^>/; END {$len = @seq; $rseq = int(rand($len)); $sname = $seq[$rseq]; $sname =~ s/^>//; print $sname}' seq.fasta

Chris