Question

Help Me Finish This Perl Code To Extract A Column In A Table

2

Entering edit mode

12.3 years ago

shane.neeley ▴ 50

Hi, I have a question similar to this one:

http://www.biostars.org/post/show/50142/any-modules-available-to-parse-this-file/#50156

I adapted my code from JCs answer in that post. Thanks JC.

Here is an example of the data file data I am opening and trying to read the columns of. The values are delimited by 4 spaces.

A bunch of junk up here. Paragraph before getting to table.

NO.  RES   DSC_SEC PROB_H    PROB_E    PROB_C
1     k      C     0.047     0.240     0.713     
2     l      C     0.067     0.365     0.568     
3     n      C     0.067     0.365     0.568     
4     f      E     0.045     0.613     0.342     
...

Here is the code I have tried, which doesn't print anything. I want to be able to gather the data from PROB_H, PROB_E, PROB_C and have them in separate lists so that I can do stuff like take the averages of them.

use strict;
use warnings;

open(FILE, "file_data.txt") or die "Cannot open file: $!";

my @data = <FILE>;

while (<FILE>) {
    next if m/^No./;
    chomp;
    my ($NO, $RES, $DSC_SEC, $PROB_H, $PROB_E, $PROB_C) = split(/\s+/, @data);
    print "$PROB_H";
}

close(FILE);

perl data extraction • 10k views

ADD COMMENT • link updated 12.3 years ago by Irsan ★ 7.8k • written 12.3 years ago by shane.neeley ▴ 50

0

Entering edit mode

Why would I be downvoted?

ADD REPLY • link 12.3 years ago by shane.neeley ▴ 50

0

Entering edit mode

Some people are harsh :) Someone probably thought this was a rather basic Perl programming question, as opposed to a bioinformatics research question.

ADD REPLY • link 12.3 years ago by Neilfws 49k

0

Entering edit mode

Two obvious errors straight off: (1) you have not escaped the period in your regular expression (so it will match "all characters"); (2) your data contains lines starting with NO (all upper-case) but your regular expression is looking for lines starting with No (lower-case "o").

ADD REPLY • link 12.3 years ago by Neilfws 49k

0

Entering edit mode

Basically, you want to implement the 'cut' unix command in Perl? Specifically, something like tail -n +2 | cut -c18-26?

ADD REPLY • link 12.3 years ago by Ketil 4.1k

score 6 · Answer 1 · 2012-09-27

6

Entering edit mode

12.3 years ago

Zev.Kronenberg 12k

It would probably be better to ask this question at stackoverflow.

Without a file it is kinda difficult to debug, but this may do the trick. You could also just use grep | awk ....

#/usr/bin/perl
use strict;
use warnings;

open(my $FH, <, "file_data.txt") or die "Cannot open file: $!";
LINE: while (my $line = <$FH>) {
    chomp $line;
    next LINE unless $line =~ /^[0-9]/;
    my ($NO, $RES, $DSC_SEC, $PROB_H, $PROB_E, $PROB_C) = split /\s+/,  $line;
    print "$PROB_H\n";
}
close($FH);

ADD COMMENT • link 12.3 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Let me know if it works.

ADD REPLY • link 12.3 years ago by Zev.Kronenberg 12k

0

Entering edit mode

That gives me my column, thanks. I'm new to perl, what is the function of LINE: and $_ in this?

ADD REPLY • link 12.3 years ago by shane.neeley ▴ 50

2

Entering edit mode

You can name your loops in perl. It can be useful to keep track of things. $_ is a special variable. In your script above it contained the line.

ADD REPLY • link 12.3 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Also, this unless it matches [0-9]?

ADD REPLY • link 12.3 years ago by shane.neeley ▴ 50

0

Entering edit mode

^ start with, a numeric value [0-9].

ADD REPLY • link 12.3 years ago by Zev.Kronenberg 12k

0

Entering edit mode

It also ends up printing one of the words up in the paragraph. Can we make it start on the header of the table, and grab the numbers below the header, like I tried before?

ADD REPLY • link 12.3 years ago by shane.neeley ▴ 50

0

Entering edit mode

And if I have a lot of columns?

ADD REPLY • link 8.0 years ago by cmcouto.silva ▴ 60

score 1 · Answer 2 · 2012-09-30

1

Entering edit mode

12.2 years ago

Irsan ★ 7.8k

Or as suggested keep it simple

[your prompt]$ grep '^[0-9]' yourfile.txt | awk '{print $4}'

to print out the fourth column of lines in yourfile.txt that start with a number

ADD COMMENT • link 12.2 years ago by Irsan ★ 7.8k

score 0 · Answer 3 · 2012-09-27

0

Entering edit mode

12.3 years ago

Eric ▴ 40

There are a couple of problems with your script. It isn't necessary to read the file into an array as you are going to iterate over the file line by line. Also your test for throwing out non matching lines is only going to match a line starting with "No." and not the other non-matching lines in the file.

     use strict;
     use warnings;

     open(FILE, "file_data.txt") or die "Cannot open file: $!";

     while (my $line = <FILE>) {
#unless the line begins with a number followed by 
#one or more whitespace characters, skip it.
         unless ($line =~ m/^\d+\s+/) {next;}   
         chomp $line;
         my ($NO, $RES, $DSC_SEC, $PROB_H, $PROB_E, $PROB_C) = split(/\s+/, $line);
         print "$PROB_H\n";
        }

    close(FILE);

ADD COMMENT • link 12.3 years ago by Eric ▴ 40

0

Entering edit mode

There is a line in the paragraph that begins with a number. Can I exclude it for containing certain words.

ADD REPLY • link 12.3 years ago by shane.neeley ▴ 50

0

Entering edit mode

Such as make that unless clause: unless ($line =~ m/^\d+\s+|residues/) {next;}
because that line that starts with a number has the word residues in it. This does not work for some reason.

ADD REPLY • link 12.3 years ago by shane.neeley ▴ 50