Question

Extracting Columns From Csv File And Puting It In An Perl Array Data Type.

2

Entering edit mode

12.2 years ago

rosaak ▴ 20

Hi guys I need a help in parsing the csv | tsv file. I would like to get each columns in an array and the name of array should correspond to the header column of the file. I am working on perl. Preferably suggestions in perl.

Tab separated file

A B C D E

227844250 - 38234815 - -

227824251 25029365 38234816 - 19554132

227344253 25029367 38234818 237786606 19554134

227834254 25029368 38234819 237786608 19554167

227834257 25029370 38234822 237784891 23309001

227834259 25029372 38234823 237786615 19554173

- 25027524 - - 19552119

Expected result I would like to get each column in an array with the array name taken from its first line [header] and not including it the array itself.

@A= (227834250 227834251 227834253 227834254 227834257 227834259 -)

@B= (- 25029365 25029367 25029368 25029370 25029372 25027524)

perl bioperl • 25k views

ADD COMMENT • link updated 3.9 years ago by terdon ▴ 430 • written 12.2 years ago by rosaak ▴ 20

score 5 · Answer 1 · 2013-01-22

I would use a hash of lists, instead of a list of lists but it's up to you. Something like this should work:

#!/usr/bin/perl -w

my %data; ## This will be a hash of lists, holding the data
my @names; ## This will hold the names of the columns
while (<>) {
    chomp;
    my @list=split(/\t/); ## Collect the elements of this line
    for (my $i=0; $i<=$#list; $i++) {
        ## If this is the 1st line, collect the names
        if ($.==1) {
            $names[$i]=$list[$i];
        }
        ## If it is not the 1st line, collect the data
        else {
            push @{$data{$names[$i]}}, $list[$i];
        }
    }
}
foreach (@names){
    local $"="\t"; ## print tab separated lists
    print "$_\t@{$data{$_}}\n";
}

If you save the script above as parse.pl and run it on your data, it will print:

$ perl parse.pl data.txt
A    227844250    227824251    227344253    227834254    227834257    227834259    -
B    -    25029365    25029367    25029368    25029370    25029372    25027524
C    38234815    38234816    38234818    38234819    38234822    38234823    -
D    -    -    237786606    237786608    237784891    237786615    -
E    -    19554132    19554134    19554167    23309001    19554173    19552119

Each of your columns can be accessed by name in the script. For example, the column "B" is @{$data{B}}.

score 4 · Answer 2 · 2013-01-22

Just for fun, you can get the same output as that given by @terdon using an R one-liner:

> Rscript --vanilla -e "write.table(t(read.delim('file.tsv')), col.names=F, quote=F, sep='\\\t')"
A    227844250    227824251    227344253    227834254    227834257    227834259    -
B    -    25029365    25029367    25029368    25029370    25029372    25027524
C    38234815    38234816    38234818    38234819    38234822    38234823    -
D    -    -    237786606    237786608    237784891    237786615    -
E    -    19554132    19554134    19554167    23309001    19554173    19552119

t() is a handy function for this kind of stuff, I realise a Perl array output is what you want—just thought I'd share this anyway.

score 0 · Answer 3 · 2013-01-22

0

Entering edit mode

12.2 years ago

Gabriel R. ★ 2.9k

use an array of arrays and push the result in each using a split function.

ADD COMMENT • link 12.2 years ago by Gabriel R. ★ 2.9k

score 0 · Answer 4 · 2013-01-22

0

Entering edit mode

12.2 years ago

macmath ▴ 170

http://www.tizag.com/perlT/perlarrays.php

ADD COMMENT • link 12.2 years ago by macmath ▴ 170

score 0 · Answer 5 · 2013-01-23

0

Entering edit mode

12.2 years ago

rosaak ▴ 20

thanks a lot for the suggestions :)

ADD COMMENT • link 12.2 years ago by rosaak ▴ 20