Help Me Finish This Perl Code To Extract A Column In A Table
3
2
Entering edit mode
12.2 years ago
shane.neeley ▴ 50

Hi, I have a question similar to this one:

http://www.biostars.org/post/show/50142/any-modules-available-to-parse-this-file/#50156

I adapted my code from JCs answer in that post. Thanks JC.

Here is an example of the data file data I am opening and trying to read the columns of. The values are delimited by 4 spaces.

A bunch of junk up here. Paragraph before getting to table.

NO.  RES   DSC_SEC PROB_H    PROB_E    PROB_C
1     k      C     0.047     0.240     0.713     
2     l      C     0.067     0.365     0.568     
3     n      C     0.067     0.365     0.568     
4     f      E     0.045     0.613     0.342     
...

Here is the code I have tried, which doesn't print anything. I want to be able to gather the data from PROB_H, PROB_E, PROB_C and have them in separate lists so that I can do stuff like take the averages of them.

use strict;
use warnings;

open(FILE, "file_data.txt") or die "Cannot open file: $!";

my @data = <FILE>;

while (<FILE>) {
    next if m/^No./;
    chomp;
    my ($NO, $RES, $DSC_SEC, $PROB_H, $PROB_E, $PROB_C) = split(/\s+/, @data);
    print "$PROB_H";
}

close(FILE);
perl data extraction • 10k views
ADD COMMENT
0
Entering edit mode

Why would I be downvoted?

ADD REPLY
0
Entering edit mode

Some people are harsh :) Someone probably thought this was a rather basic Perl programming question, as opposed to a bioinformatics research question.

ADD REPLY
0
Entering edit mode

Two obvious errors straight off: (1) you have not escaped the period in your regular expression (so it will match "all characters"); (2) your data contains lines starting with NO (all upper-case) but your regular expression is looking for lines starting with No (lower-case "o").

ADD REPLY
0
Entering edit mode

Basically, you want to implement the 'cut' unix command in Perl? Specifically, something like tail -n +2 | cut -c18-26?

ADD REPLY
6
Entering edit mode
12.2 years ago

It would probably be better to ask this question at stackoverflow.

Without a file it is kinda difficult to debug, but this may do the trick. You could also just use grep | awk ....

#/usr/bin/perl
use strict;
use warnings;

open(my $FH, <, "file_data.txt") or die "Cannot open file: $!";
LINE: while (my $line = <$FH>) {
    chomp $line;
    next LINE unless $line =~ /^[0-9]/;
    my ($NO, $RES, $DSC_SEC, $PROB_H, $PROB_E, $PROB_C) = split /\s+/,  $line;
    print "$PROB_H\n";
}
close($FH);
ADD COMMENT
0
Entering edit mode

Let me know if it works.

ADD REPLY
0
Entering edit mode

That gives me my column, thanks. I'm new to perl, what is the function of LINE: and $_ in this?

ADD REPLY
2
Entering edit mode

You can name your loops in perl. It can be useful to keep track of things. $_ is a special variable. In your script above it contained the line.

ADD REPLY
0
Entering edit mode

Also, this unless it matches [0-9]?

ADD REPLY
0
Entering edit mode

^ start with, a numeric value [0-9].

ADD REPLY
0
Entering edit mode

It also ends up printing one of the words up in the paragraph. Can we make it start on the header of the table, and grab the numbers below the header, like I tried before?

ADD REPLY
0
Entering edit mode

And if I have a lot of columns?

ADD REPLY
1
Entering edit mode
12.2 years ago
Irsan ★ 7.8k

Or as suggested keep it simple

[your prompt]$ grep '^[0-9]' yourfile.txt | awk '{print $4}'

to print out the fourth column of lines in yourfile.txt that start with a number

ADD COMMENT
0
Entering edit mode
12.2 years ago
Eric ▴ 40

There are a couple of problems with your script. It isn't necessary to read the file into an array as you are going to iterate over the file line by line. Also your test for throwing out non matching lines is only going to match a line starting with "No." and not the other non-matching lines in the file.

     use strict;
     use warnings;

     open(FILE, "file_data.txt") or die "Cannot open file: $!";

     while (my $line = <FILE>) {
#unless the line begins with a number followed by 
#one or more whitespace characters, skip it.
         unless ($line =~ m/^\d+\s+/) {next;}   
         chomp $line;
         my ($NO, $RES, $DSC_SEC, $PROB_H, $PROB_E, $PROB_C) = split(/\s+/, $line);
         print "$PROB_H\n";
        }

    close(FILE);
ADD COMMENT
0
Entering edit mode

There is a line in the paragraph that begins with a number. Can I exclude it for containing certain words.

ADD REPLY
0
Entering edit mode

Such as make that unless clause: unless ($line =~ m/^\d+\s+|residues/) {next;}
because that line that starts with a number has the word residues in it. This does not work for some reason.

ADD REPLY

Login before adding your answer.

Traffic: 1759 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6