Extracting Columns From Csv File And Puting It In An Perl Array Data Type.
5
2
Entering edit mode
11.9 years ago
rosaak ▴ 20

Hi guys I need a help in parsing the csv | tsv file. I would like to get each columns in an array and the name of array should correspond to the header column of the file. I am working on perl. Preferably suggestions in perl.


Tab separated file

A B C D E

227844250 - 38234815 - -

227824251 25029365 38234816 - 19554132

227344253 25029367 38234818 237786606 19554134

227834254 25029368 38234819 237786608 19554167

227834257 25029370 38234822 237784891 23309001

227834259 25029372 38234823 237786615 19554173

- 25027524 - - 19552119

Expected result I would like to get each column in an array with the array name taken from its first line [header] and not including it the array itself.

@A= (227834250 227834251 227834253 227834254 227834257 227834259 -)

@B= (- 25029365 25029367 25029368 25029370 25029372 25027524)

perl bioperl • 25k views
ADD COMMENT
5
Entering edit mode
11.9 years ago
terdon ▴ 430

I would use a hash of lists, instead of a list of lists but it's up to you. Something like this should work:

#!/usr/bin/perl -w

my %data; ## This will be a hash of lists, holding the data
my @names; ## This will hold the names of the columns
while (<>) {
    chomp;
    my @list=split(/\t/); ## Collect the elements of this line
    for (my $i=0; $i<=$#list; $i++) {
        ## If this is the 1st line, collect the names
        if ($.==1) {
            $names[$i]=$list[$i];
        }
        ## If it is not the 1st line, collect the data
        else {
            push @{$data{$names[$i]}}, $list[$i];
        }
    }
}
foreach (@names){
    local $"="\t"; ## print tab separated lists
    print "$_\t@{$data{$_}}\n";
}

If you save the script above as parse.pl and run it on your data, it will print:

$ perl parse.pl data.txt
A    227844250    227824251    227344253    227834254    227834257    227834259    -
B    -    25029365    25029367    25029368    25029370    25029372    25027524
C    38234815    38234816    38234818    38234819    38234822    38234823    -
D    -    -    237786606    237786608    237784891    237786615    -
E    -    19554132    19554134    19554167    23309001    19554173    19552119

Each of your columns can be accessed by name in the script. For example, the column "B" is @{$data{B}}.

ADD COMMENT
4
Entering edit mode
11.9 years ago
Ben ★ 2.0k

Just for fun, you can get the same output as that given by @terdon using an R one-liner:

> Rscript --vanilla -e "write.table(t(read.delim('file.tsv')), col.names=F, quote=F, sep='\\\t')"
A    227844250    227824251    227344253    227834254    227834257    227834259    -
B    -    25029365    25029367    25029368    25029370    25029372    25027524
C    38234815    38234816    38234818    38234819    38234822    38234823    -
D    -    -    237786606    237786608    237784891    237786615    -
E    -    19554132    19554134    19554167    23309001    19554173    19552119

t() is a handy function for this kind of stuff, I realise a Perl array output is what you want—just thought I'd share this anyway.

ADD COMMENT
0
Entering edit mode

I found that useful, thanks. Is there maybe one extra backslash for the separator type?

ADD REPLY
0
Entering edit mode
11.9 years ago
Gabriel R. ★ 2.9k

use an array of arrays and push the result in each using a split function.

ADD COMMENT
0
Entering edit mode
ADD COMMENT
0
Entering edit mode
11.9 years ago
rosaak ▴ 20

thanks a lot for the suggestions :)

ADD COMMENT

Login before adding your answer.

Traffic: 1870 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6