Perl Script Dies When Processing Large Datafiles. Is It A Perl Buffering Issue?
1
0
Entering edit mode
11.7 years ago
dannyjmh ▴ 20

Hey everyone! I think i'm having a buffering issue since i need to read and parse big text files (created by myself in previous lines of the code) to finally print things in another file. At some point, after reading a file with 90855 lines, the script is not reading a line of the next file completely. I have counted the number of characters read until this happens: 233467, and therefore tried to flush the buffer and sleep before reading the next line of the file. Doesn't work. Any suggestion, please? thanks a lot. The part of the code coming:

for my $o (0..1){
  if ($o==0){
    @files = reverse <*_SITES_3utr>;
  }else{
    @files = reverse <*_SITES_cds>;
  }
undef(%pita_sites_nu);undef(%pita_tot_score);my($comp_p);undef(%allowed_wobbles);#undef(%site_nu);
foreach $i(@files){
   my $buff=0;
  print "Analyzing $i\n";sleep(1);
  $program= $1 if $i=~ /(\w+)_SITES/;
  open(FIL, $i) or die "$!: $i\n";
  while(<FIL>){

    $buff += length($_); if ($buff >= 230000){$buff=0;sleep(1);select((select(FIL), $|=1)[0]);} #FLUSH THE BUFFER, NOT WORKING!!!

    undef($a);
    unless($.== 1){
      if ($o==0){
        if (/^\d+\t(\S+)\t(\S+)\t(\d+)\t(\d+)\t(\S+)\t(\S+)\t(.*)/){
          $mirna= $1; $target= $2; $start= $3; $end= $4; $site= $5; $comp_p= $6;$a= $7;$j= "${mirna}_${target}_${start}_$end";
          $site_nu{$j}= "$mirna\t$target\t$start\t$end\t$site\t$comp_p";#Store each site in a hash
        }else{die "$buff characters, in line $.:$_\n"} #DIES HERE!!!
      }else{
        if (/^\d+\t(\S+)\t(\S+)\t(\d+)\t(\d+)\t(\S+)\t(.*)/){
          $mirna= $1; $target= $2; $start= $3; $end= $4; $site= $5;$a= $6;$j= "${mirna}_${target}_${start}_$end";
          $site_nu{$j}= "$mirna\t$target\t$start\t$end\t$site";#Store each site in a hash
        }
      }

Ii dies at the "DIES HERE!!" die, after reading 3413 characters of the second file. Happens because the regex doesn't work since only half of the line is in $_. Help please! Thanks again.

perl • 4.4k views
ADD COMMENT
1
Entering edit mode

Stupid question, but when you look at the line in the second file where your program is dying, does it have the right number of fields? Are they properly delimited? In my experience, Perl is able to handle files that have millions of lines without any special attention to buffering on my part, so I would be very skeptical that your issue lies there.

ADD REPLY
0
Entering edit mode

hey Mitch. Thanks. No stupid question at all, is well received. Yes, all the data is in the file. In the end, I had to flush the filehandle I was using to write to the files. Before start parsing them. Maybe because I had to open and write to many files earlier in the script I got a buffering problem. I'm new to Perl, so....Thanks so much.

ADD REPLY
1
Entering edit mode

My pleasure Danny. I've worked with Perl on and off quite a bit, so I'm happy to help when I can. The only other thing to think about is maybe closing all the files you had open earlier in the script? As for stupid questions, usually when I encounter a programming bug that looks like a fault with the language (e.g. no more buffer), the answer is more likely to be a mistake that I made than exposing the shortcomings of a programming language. In my experience, when it's the people who wrote a programming language and/or cosmic rays magically changing output from a program vs my own mistakes, my own mistakes are always the cause ;)

ADD REPLY
3
Entering edit mode
11.7 years ago

Your buf variable is an integer and as such it will not cause any type of memory overflow so that is not the problem in the least bit. There is no need to flush it, I am not even sure what that piece of code that you call flushing does, but I am almost certain it is not needed.

You should not exit a program with a "die" error just because a line does not match a regexp.

The correct solution is to split the line by tabs and then investigate the number of columns and their contents.

ADD COMMENT
0
Entering edit mode

Thank you Istvan. I also tried splitting the line and the same thing happened. Yes, flushing input buffer is silly. In the end, I flushed the output filehandle I was using before start parsing the files and problem solved. Thank you so much for the help.

ADD REPLY

Login before adding your answer.

Traffic: 2246 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6