Question

Perl Script Dies When Processing Large Datafiles. Is It A Perl Buffering Issue?

0

Entering edit mode

12.1 years ago

dannyjmh ▴ 20

Hey everyone! I think i'm having a buffering issue since i need to read and parse big text files (created by myself in previous lines of the code) to finally print things in another file. At some point, after reading a file with 90855 lines, the script is not reading a line of the next file completely. I have counted the number of characters read until this happens: 233467, and therefore tried to flush the buffer and sleep before reading the next line of the file. Doesn't work. Any suggestion, please? thanks a lot. The part of the code coming:



for my $o (0..1){
  if ($o==0){
    @files = reverse <*_SITES_3utr>;
  }else{
    @files = reverse <*_SITES_cds>;
  }
undef(%pita_sites_nu);undef(%pita_tot_score);my($comp_p);undef(%allowed_wobbles);#undef(%site_nu);
foreach $i(@files){
   my $buff=0;
  print "Analyzing $i\n";sleep(1);
  $program= $1 if $i=~ /(\w+)_SITES/;
  open(FIL, $i) or die "$!: $i\n";
  while(<FIL>){

    $buff += length($_); if ($buff >= 230000){$buff=0;sleep(1);select((select(FIL), $|=1)[0]);} #FLUSH THE BUFFER, NOT WORKING!!!

    undef($a);
    unless($.== 1){
      if ($o==0){
        if (/^\d+\t(\S+)\t(\S+)\t(\d+)\t(\d+)\t(\S+)\t(\S+)\t(.*)/){
          $mirna= $1; $target= $2; $start= $3; $end= $4; $site= $5; $comp_p= $6;$a= $7;$j= "${mirna}_${target}_${start}_$end";
          $site_nu{$j}= "$mirna\t$target\t$start\t$end\t$site\t$comp_p";#Store each site in a hash
        }else{die "$buff characters, in line $.:$_\n"} #DIES HERE!!!
      }else{
        if (/^\d+\t(\S+)\t(\S+)\t(\d+)\t(\d+)\t(\S+)\t(.*)/){
          $mirna= $1; $target= $2; $start= $3; $end= $4; $site= $5;$a= $6;$j= "${mirna}_${target}_${start}_$end";
          $site_nu{$j}= "$mirna\t$target\t$start\t$end\t$site";#Store each site in a hash
        }
      }

Ii dies at the "DIES HERE!!" die, after reading 3413 characters of the second file. Happens because the regex doesn't work since only half of the line is in $_. Help please! Thanks again.

perl • 4.7k views

ADD COMMENT • link updated 10.0 years ago by Biostar 20 • written 12.1 years ago by dannyjmh ▴ 20

1

Entering edit mode

Stupid question, but when you look at the line in the second file where your program is dying, does it have the right number of fields? Are they properly delimited? In my experience, Perl is able to handle files that have millions of lines without any special attention to buffering on my part, so I would be very skeptical that your issue lies there.

ADD REPLY • link 12.1 years ago by Mitch Bekritsky ★ 1.3k

0

Entering edit mode

hey Mitch. Thanks. No stupid question at all, is well received. Yes, all the data is in the file. In the end, I had to flush the filehandle I was using to write to the files. Before start parsing them. Maybe because I had to open and write to many files earlier in the script I got a buffering problem. I'm new to Perl, so....Thanks so much.

ADD REPLY • link 12.1 years ago by dannyjmh ▴ 20

1

Entering edit mode

My pleasure Danny. I've worked with Perl on and off quite a bit, so I'm happy to help when I can. The only other thing to think about is maybe closing all the files you had open earlier in the script? As for stupid questions, usually when I encounter a programming bug that looks like a fault with the language (e.g. no more buffer), the answer is more likely to be a mistake that I made than exposing the shortcomings of a programming language. In my experience, when it's the people who wrote a programming language and/or cosmic rays magically changing output from a program vs my own mistakes, my own mistakes are always the cause ;)

ADD REPLY • link 12.1 years ago by Mitch Bekritsky ★ 1.3k

score 3 · Answer 1 · 2013-03-31

3

Entering edit mode

12.1 years ago

Istvan Albert 102k

Your buf variable is an integer and as such it will not cause any type of memory overflow so that is not the problem in the least bit. There is no need to flush it, I am not even sure what that piece of code that you call flushing does, but I am almost certain it is not needed.

You should not exit a program with a "die" error just because a line does not match a regexp.

The correct solution is to split the line by tabs and then investigate the number of columns and their contents.

ADD COMMENT • link 12.1 years ago by Istvan Albert 102k

0

Entering edit mode

Thank you Istvan. I also tried splitting the line and the same thing happened. Yes, flushing input buffer is silly. In the end, I flushed the output filehandle I was using before start parsing the files and problem solved. Thank you so much for the help.

ADD REPLY • link 12.1 years ago by dannyjmh ▴ 20