I am trying to think of the best (most efficient) way to concatenate multiple CSV files into one file using Perl? The files contain frequency distributions for EnsEMBL sequence data. I aim to pass the merged CSV file to R for plotting.
The input CSV files have two columns for size and freqency e.g.
name1.size,name1.frequency
1,10
2,30
3,20
4,70
5,500
I need to concatenate these into one CSV file, so all the columns from all the files are maintained. However, the columns can differ in row length. I require the following output for example:
name1.size,name1.frequency,name2.size,name2.frequency,name3.size,name3.frequency
1,10,1,20,1,30
2,30,2,10,2,50
3,20,4,70,3,10
4,70,,,5,300
5,500
I'm struggling to think of the best solution to achieve this. I was thinking perhaps a hash of arrays? I've been using Text::CSV_XS, but I can't see a way to write columns, only rows?
for my $organism (@organisms) {
# setup CSV
my $csv = Text::CSV_XS->new ({ binary => 1 });
# open file
my $in_file = File::Spec->catfile($path, $organism, $feature . "_freqs.csv");
open my $fh, "<", "$in_file" or die "$in_file: $!";
# traverse file and push to
my $rows = [];
while (my $row = $csv->getline($fh)) {
push(@$rows, $row);
}
$csv_hash{$organism} = $rows;
# close the CSV
$csv->eof or $csv->error_diag;
close $fh or die "$in_file: $!";
}
I'm doing the above at the moment and then thought about iterating over the hash keys and writing out the columns? Any other ideas?
Update:
I'm doing this to order by size (descending) of hash contents:
# get the size of the array in each hash array and sort in descending size order
foreach my $k (sort {scalar(@{$csv_hash{$b}}) <=> scalar(@{$csv_hash{$a}})} keys %csv_hash) {
push(@order, $k);
}
I could then iterate through and build each line of the CSV from the csv_hash?
Be careful with the word "merge". It implies the need to deal with situations such as: redundancy (duplicate rows) or rows that share some kind of key (e.g. row name). It sounds to me as though you just want to concatenate the files (= cbind in R) ?
R is much better at handling tabular data anyway. stick with R for this step.
Yup, there's also
smartbind()
in the gtools package.I think perhaps the best way would be to iterate through the hash keys and output to CSV in descending order? This way I will ensure that there are no column alignment issues?
I'm not sure that your question is really related to bioinformatics.... Nevertheless, I don't understand why you want to use perl instead of a simple 'sort| uniq' pipeline ?
Hi Pierre, I'm working with frequency distributions from bioinformatics data, although I suppose it isn't pure bioinformatics, it is related!
I'm open to suggestions! I need to merge the CSV files to pass them to R for analysis, so whichever works best?
Updated the question with that information!
Yes, I had thought about this, but then if I have 100 CSV files and I need to share this data, or perform statistical analysis on the data as a whole, then it makes it more difficult!
Actually, I think you might be right... I can pass the files to R, get it to use cbind and then write a CSV from the columns? Then pull the filename for that back to Perl!
This looks perfect http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/gdata/html/cbindX.html
My apologise, edited the content to reflect this! cbind however, requires the columns to be of equal row length. I there found a package called cbindX that takes columns of unequal row length!
My apologies, edited the content to reflect this! cbind however, requires the columns to be of equal row length. I there found a package called cbindX that takes columns of unequal row length!