Question

How To Find Any Of Many Motifs In Certain Sequences Using Perl?

8

Entering edit mode

15.2 years ago

nikulina ▴ 300

Dear colleagues, I am applying for your help once more. Here is my perl script for counting the length between the binding sites and the start points of exons. So, there are some genes saved in the file called sequence.txt and some amount of binding sites in motif.txt. The thing I want to do is to count the length of the fragment for every gene if it has any of the binding sites from motif.txt. This current script does not work in the way I would like it. How should it be changed? Shall I add another while loop for $motif?

$string_filename = 'sequence.txt';  

open(FILE, $string_filename) || die("Couldn't read file $string_filename\n");

$motif_filename = 'motif.txt';   

open(MOTIF, $motif_filename) || die("Couldn't read file $motif_filename\n"); 

local $/ = "\n>";  

   while (my $seq = <FILE>) {
chomp $seq;
$seq =~ s/^>*.+\n//;  
$seq =~ s/\n//g;  

$R = length $seq;  
  $motif = <MOTIF>; 
chomp $motif;
$motif =~ s/^>*.+\n//; 
$motif =~ s/\n//g; 
  if ( $seq =~ /$motif/ ) {     ## insert actual binding site 
    $M = $';
    $W = length $M;

    if ( $seq =~ /[A-Z]/) {    ##  exon start
        $K = $`;   
        $Z = length $K;  
        $x = $W + $Z - $R;     
        print "\n\ the distance is the following: $x\n\n";

    } else {  
        print "\n\ I couldn't find the start codon.\n\n";
    }
} else {  
    print "\n\ I couldn't find the binding site.\n\n";   
}

}

close MOTIF;   
close FILE;   

exit;

perl motif sequence • 8.9k views

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 15.2 years ago by nikulina ▴ 300

3

Entering edit mode

please, show us what your inputs look like.

ADD REPLY • link 15.2 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

There is a FASTQ-FASTA converter here - and Bioperl's Bio::SeqIO::fastq will handle some FASTQ formats.

ADD REPLY • link updated 6.9 years ago by Ram 45k • written 15.2 years ago by Neilfws 49k

0

Entering edit mode

the binding sites look like this

ADD REPLY • link updated 6.9 years ago by Ram 45k • written 15.2 years ago by nikulina ▴ 300

0

Entering edit mode

and the genes

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 15.2 years ago by nikulina ▴ 300

0

Entering edit mode

binding site=A FASTQ file ??!!!!!

ADD REPLY • link 15.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I recognise that it is a problem but i haven't found the way to convert it to fasts yet.

ADD REPLY • link 15.2 years ago by nikulina ▴ 300

0

Entering edit mode

And further discussion of FASTQ-FASTA conversion at StackOverflow.

ADD REPLY • link updated 6.9 years ago by Ram 45k • written 15.2 years ago by Neilfws 49k

Ram · Answer 1 · 2010-06-17

I think you need to take a step back and think about the logic of your problem, before writing the code. Break your problem down into steps, like this:

Read motifs into a data structure over which you can loop (e.g. an array)
Read in your sequence(s)
For each sequence, identify segments between binding site and exon start
For each segment...
For each motif, check for match to segment
If at least one match, store/print segment length (and perhaps other useful information)

For tools to convert FASTQ-FASTA, see comments under your question.

As we said in your previous query, Bioperl makes this kind of task very trivial. There is a learning curve, but I would urge you to investigate - it will pay off in the long term. Using more "standard" code also makes it a lot easier for other people to read and debug.

score 3 · Answer 2 · 2010-06-16

3

Entering edit mode

15.2 years ago

Jeremy Leipzig 23k

you should load your motifs into an array and then iterate over that motif array for each sequence

ADD COMMENT • link 15.2 years ago by Jeremy Leipzig 23k

Ram · Answer 3 · 2010-06-18

I eventually recognised how to modify my script to do the job I wished. Now it looks like the following:)

$string_filename = 'seq.txt';  
open(FILE, $string_filename) || die("Couldn't read file $string_filename\n");   
$motif_filename = 'motif.txt';   
open(MOTIF, $motif_filename) || die("Couldn't read file $motif_filename\n");   
local $/ = "\n>";

while (@motif = <MOTIF>) {              
  while (my $seq = <FILE>) {
    chomp $seq;
    $seq =~ s/^>*.+\n//;  
    $seq =~ s/\n//g;  
    $R = length $seq;    
    foreach $site (@motif) {
      chomp $site;
      $site =~ s/^>*.+\n//; 
      $site =~ s/\n//g;  
      if ( $seq =~ /$site/ ) {     
        $M = $';
        $W = length $M;
        if ( $seq =~ /[AGTC]/) {   
          $K = $`;   
          $Z = length $K;  
          $x = $W + $Z - $R;     
          print "\nthe distance is the following: $x\n\n";
                               } 
                             } 
                            } 
                           }  
                          } 
close FILE;    
exit;

Today I tried to do the same using BioPerl module and recognised it to be much more easier indeed. Thanks a million for your help and support:)