Question

Best Way To Concatenate Columns In Multiple Csv Files Of Differing Row Lengths?

3

Entering edit mode

14.2 years ago

Steve Moss 2.3k

I am trying to think of the best (most efficient) way to concatenate multiple CSV files into one file using Perl? The files contain frequency distributions for EnsEMBL sequence data. I aim to pass the merged CSV file to R for plotting.

The input CSV files have two columns for size and freqency e.g.

name1.size,name1.frequency
1,10
2,30
3,20
4,70
5,500

I need to concatenate these into one CSV file, so all the columns from all the files are maintained. However, the columns can differ in row length. I require the following output for example:

name1.size,name1.frequency,name2.size,name2.frequency,name3.size,name3.frequency
1,10,1,20,1,30
2,30,2,10,2,50
3,20,4,70,3,10
4,70,,,5,300
5,500

I'm struggling to think of the best solution to achieve this. I was thinking perhaps a hash of arrays? I've been using Text::CSV_XS, but I can't see a way to write columns, only rows?

for my $organism (@organisms) {
    # setup CSV
    my $csv = Text::CSV_XS->new ({ binary => 1 });

    # open file
    my $in_file = File::Spec->catfile($path, $organism, $feature . "_freqs.csv");
      open my $fh, "<", "$in_file" or die "$in_file: $!";

    # traverse file and push to 
    my $rows = [];
    while (my $row = $csv->getline($fh)) {
         push(@$rows, $row);
    }

    $csv_hash{$organism} = $rows;

    # close the CSV
    $csv->eof or $csv->error_diag;
    close $fh or die "$in_file: $!";
}

I'm doing the above at the moment and then thought about iterating over the hash keys and writing out the columns? Any other ideas?

Update:

I'm doing this to order by size (descending) of hash contents:

# get the size of the array in each hash array and sort in descending size order
foreach my $k (sort {scalar(@{$csv_hash{$b}}) <=> scalar(@{$csv_hash{$a}})} keys %csv_hash) {
    push(@order, $k);        
}

I could then iterate through and build each line of the CSV from the csv_hash?

perl array parsing • 23k views

ADD COMMENT • link updated 10.5 years ago by pld 5.1k • written 14.2 years ago by Steve Moss 2.3k

2

Entering edit mode

Be careful with the word "merge". It implies the need to deal with situations such as: redundancy (duplicate rows) or rows that share some kind of key (e.g. row name). It sounds to me as though you just want to concatenate the files (= cbind in R) ?

ADD REPLY • link 14.2 years ago by Neilfws 49k

1

Entering edit mode

R is much better at handling tabular data anyway. stick with R for this step.

ADD REPLY • link 14.2 years ago by Jeremy Leipzig 23k

1

Entering edit mode

Yup, there's also smartbind() in the gtools package.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 14.2 years ago by Neilfws 49k

0

Entering edit mode

I think perhaps the best way would be to iterate through the hash keys and output to CSV in descending order? This way I will ensure that there are no column alignment issues?

ADD REPLY • link 14.2 years ago by Steve Moss 2.3k

0

Entering edit mode

I'm not sure that your question is really related to bioinformatics.... Nevertheless, I don't understand why you want to use perl instead of a simple 'sort| uniq' pipeline ?

ADD REPLY • link 14.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hi Pierre, I'm working with frequency distributions from bioinformatics data, although I suppose it isn't pure bioinformatics, it is related!

I'm open to suggestions! I need to merge the CSV files to pass them to R for analysis, so whichever works best?

ADD REPLY • link 14.2 years ago by Steve Moss 2.3k

0

Entering edit mode

Updated the question with that information!

ADD REPLY • link 14.2 years ago by Steve Moss 2.3k

0

Entering edit mode

Yes, I had thought about this, but then if I have 100 CSV files and I need to share this data, or perform statistical analysis on the data as a whole, then it makes it more difficult!

ADD REPLY • link 14.2 years ago by Steve Moss 2.3k

0

Entering edit mode

Actually, I think you might be right... I can pass the files to R, get it to use cbind and then write a CSV from the columns? Then pull the filename for that back to Perl!

ADD REPLY • link 14.2 years ago by Steve Moss 2.3k

0

Entering edit mode

This looks perfect http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/gdata/html/cbindX.html

ADD REPLY • link 14.2 years ago by Steve Moss 2.3k

0

Entering edit mode

My apologise, edited the content to reflect this! cbind however, requires the columns to be of equal row length. I there found a package called cbindX that takes columns of unequal row length!

ADD REPLY • link 14.2 years ago by Steve Moss 2.3k

0

Entering edit mode

My apologies, edited the content to reflect this! cbind however, requires the columns to be of equal row length. I there found a package called cbindX that takes columns of unequal row length!

ADD REPLY • link 14.2 years ago by Steve Moss 2.3k

Ram · Answer 1 · 2011-01-27

7

Entering edit mode

14.2 years ago

Giovanni M Dall'Olio 28k

If you want to merge two files to pass them to R, I suggest you to merge them in R directly.

R is a lot better than perl for handling tabular data: for example, in perl there is not a builtin equivalent of the data.frame object, or of the read.csv functions. The same is valid for python and other languages; the fact is that R is a language for analyzing and visualizing data, while the others are general purpose languages.

In any case, you can easily merge two tabular files in R with the merge function from the base package:

> data1 <- read.csv('file1.txt')
> data2 <- read.csv('file2.txt')
> merged_data <- merge(data1, data2, by.x="column_in_data1", by.y="column_in_data2")

Have a look at ?merge for further information. You can also define other types of JOIN and select only some columns.

An alternative is to upload the data into a database and use a SQL call.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 14.2 years ago by Giovanni M Dall'Olio 28k

1

Entering edit mode

thanks gawbul! base::merge should handle records with different lenght as well, but if you say that the other is faster, I will try it.

ADD REPLY • link 14.2 years ago by Giovanni M Dall'Olio 28k

1

Entering edit mode

Another alternative is to use libraries like sqldf and use SQL instructions.

ADD REPLY • link 14.2 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

I found a great little library called gdata that has a function called cbindX that merges columns with differing row lengths. It seems to work faster the merge! I'm working on it at the moment!

ADD REPLY • link 14.2 years ago by Steve Moss 2.3k

0

Entering edit mode

Link here: http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/gdata/html/cbindX.html

ADD REPLY • link 14.2 years ago by Steve Moss 2.3k

Ram · Answer 2 · 2011-01-27

3

Entering edit mode

14.2 years ago

brentp 24k

If you sort your files, you can join them using the linux command join.

Given files a.txt:

a 10
b 22
d 33
f 44

and b.txt:

a 10
b 22
c 33
e 44

where the first column is the id to join on, you can do:

$ join -a 1 -a 2 -e " " a.txt b.txt

to get:

From there, you can easily add all columns after the first in your language of choice.

NOTE: the files have to be sorted on the column you're joining on.

NOTE: you can adjust the separator with the -e flag.

see the man page

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 14.2 years ago by brentp 24k

0

Entering edit mode

Hi brentp, thanks for the reply. I would need to maintain the columns in parallel, rather than merge the alpha column in your example. So, the pairs of columns for a.txt and b.txt would be appended horizontally. Your example, is useful however, for something else I am planning to do later in the project.

ADD REPLY • link 14.2 years ago by Steve Moss 2.3k

0

Entering edit mode

My apologies, as neilfws pointed out, my question was confusing! I need to concatenate the columns, as opposed to merge.

ADD REPLY • link 14.2 years ago by Steve Moss 2.3k

Ram · Answer 3 · 2011-05-14

0

Entering edit mode

13.9 years ago

Steve Moss 2.3k

I found the R package gdata on CRAN.

This provides the cbindX function.

I can do this:

library(gdata)
data1 <- read.csv('file1.csv')
data2 <- read.csv('file2.csv')
data3 <- read.csv('file3.csv')
concat_data <- cbindX(data1, data2, data3)

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 13.9 years ago by Steve Moss 2.3k

2

Entering edit mode

Even though you think this might be the right answer, it is not advisable to use this function in most cases, instead it's the road into disaster in most cases. It is unsafe because it assumes either the order of rows doesn't matter or the rows left out occur only at the end. However, in bioinformatics, rows often correspond to identifiable entities (aka genes) and are better merged on an identifier column as suggested by Giovanni. In fact, imho, it would be better the cbindX function would not exist at all.

ADD REPLY • link 13.9 years ago by Michael 55k

1

Entering edit mode

Well, you edited your question to fit your answer, that's nice, I would take it back now, but now I can't, unless you edit your question, too. The -1 was because your accepted your own (problematic (or wrong, before edit) answer as pointed out in my first comment prior to your edit which replaced 'merge' with 'concatenate'). So, the -1 was meant as a reward for boldness ;)

ADD REPLY • link 13.9 years ago by Michael 55k

0

Entering edit mode

The order makes no difference in this case, so is not an issue to me! Perhaps this may not be applicable to some others, but is relevant within the context that I require it. Thanks for pointing that out though :-)

ADD REPLY • link 13.9 years ago by Steve Moss 2.3k

0

Entering edit mode

Also, perhaps this was partly confusion on my part with my original question, but I aren't merging columns, and am instead concatenating them!

ADD REPLY • link 13.9 years ago by Steve Moss 2.3k

0

Entering edit mode

Also, merge only takes two data frames at a time!

ADD REPLY • link 13.9 years ago by Steve Moss 2.3k

0

Entering edit mode

Not sure why -1 here, as it is relevant to my question!?

ADD REPLY • link 13.9 years ago by Steve Moss 2.3k