Memory-Efficient Way To Transpose Csv/Text File?
5
0
Entering edit mode
11.1 years ago

I've got a few giant text files I'd like to transpose and was wondering whether there was any simple way of doing this. They are in various formats, but "matrixy" - that is same number of columns, no missing values.

If I do not want to keep the whole thing in memory I can't think of any way to do it without n*k complexity, where n is the number of lines and k is the number of columns. As the files contain floating point values of varying length the lines aren't the same length I don't think I can read line by line by choosing a specific number of bytes/characters.

Would be nice if a standard tool like plink could do it, but it only seems to handle ped files.

Ps. there are so many columns that awk is unlikely to be up to the task - has a max NF of 32k IIRC.

matrix scripting • 11k views
ADD COMMENT
0
Entering edit mode

R has a transpose function, have you tried that? Don't re-invent the wheel if you don't have to.

ADD REPLY
4
Entering edit mode
11.1 years ago

If you want to limit yourself based on I/O, rather than memory, you can do the following (really ugly pseudocode that should give you the general idea):

#split each column into it's own file
for each line in matrix
     array = split line
     for i in 1:length(array)
          print file$i array[$i] . "\t";

#open each file sequentially, output its data as a row in the output file:
foreach file (glob(files))
    open file
    line = getline(file)
    print finalfile line . "\n"

#clean up temp files, etc
ADD COMMENT
1
Entering edit mode

If your matrix has more than 1021 columns on Linux or 253 columns on Mac OS X, you may run out of file handles to allocate to columns. For more details, see results from: https://www.google.com/search?q=file+descriptor+limit

ADD REPLY
0
Entering edit mode

One way to deal with that is to change kernel parameters. Another way would be to operate on smaller groups of columns, always working with fewer columns than available file descriptors. You'd need to walk through the input file however many column-multiples of the file descriptor limit n, appending columns 1..n to the output rows, then columns (n+1)..2n, etc.

ADD REPLY
0
Entering edit mode

Will try this and test against my version. Upvote.

ADD REPLY
3
Entering edit mode
11.1 years ago

Can you use disk as intermediate storage?

If you don't want to keep the entire matrix in (system) memory before writing it out, perhaps you can write placeholder bytes to disk directly, and copy over one cell at a time:

  1. Walk through your input nxm matrix to determine maximum precision.
  2. Write out an intermediate mxn CSV matrix file, containing zero-padded elements to the precision found in step 1.

    This creates an empty CSV matrix with equally-sized cells. In C, you could use one-byte char elements for commas and numbers. This is sized in such a way that you can write a function to directly translate or map a position in your input matrix to a byte offset in the empty CSV matrix.

  3. Walk through the input matrix again, writing cell ij from your input matrix over the cell ji in your empty CSV output matrix.

    In C, use fseek(), for example, to jump to the calculated byte offset for cell ji. Then fwrite() a zero-padded string from the original matrix's cell ij to this stream's offset (leaving commas in place).

  4. Clean things up: Walk through the intermediate CSV matrix one row at a time, stripping zero-pad characters and writing this stripped matrix to the final, transposed output.

The time complexity is O(nm) from walking through your nxm matrix four times. Since cells in the intermediate output matrix are equally spaced, mapping cells from the input to output matrix is O(1). This should be very (system) memory efficient as you're only storing one cell at a time in memory, reading/writing that cell from disk. You could need a fair amount of disk storage, however, for caching the intermediate matrix, depending on the desired precision.

ADD COMMENT
1
Entering edit mode
11.1 years ago

I quiclky wrote a tool to transpose a big text. It uses the class SortingCollection of the picard java API. https://github.com/lindenb/jvarkit/wiki/Biostar84786

First the data are read and inserted in the SortingCollection

                            for(;;)
                                {
                                int c=in.read();
                                if(c=='\n' || c==-1)
                                        {
                                        sorter.add(new Cell(row,col,b));
                                        row++;
                                        col=0;
                                        b.setLength(0);
                                        if(c==-1) break;
                                        if(row%10000==0) LOG.info("row:"+row);
                                        }
                                else if(c==delimiter)
                                        {
                                        sorter.add(new Cell(row,col,b));
                                        b.setLength(0);
                                        col++;
                                        }
                                else
                                        {
                                        b.append((char)c);
                                        }
                                }

then we ouput the data using the new order:

                        long x=0L;
                        for(;;)
                                {

                                if(!iter.hasNext())
                                        {
                                        System.out.println();
                                        break;
                                        }
                                Cell c=iter.next();
                                if(c.col!=curr_col)
                                        {
                                        if(curr_col!=-1L) System.out.println();
                                        x=0L;
                                        curr_col=c.col;
                                        }
                                if(x>0L) System.out.print(DELIM);
                                System.out.print(c.content);
                                x++;
                                }
ADD COMMENT
0
Entering edit mode
11.1 years ago

This seems to work, but feels kinda stupid and is slower than slow:

def transpose_matrix_memory_efficiently(input_file, output_delimiter="\t"):

    with open(input_file) as original_file, open(input_file+"_t","w+") as transposed_file:

        number_of_columns = len(original_file.readline().split())

        original_file.seek(0)

        for current_column in range(number_of_columns):

           temporary_line = []

           for line in original_file:

               value = line.split()[current_column]

           temporary_line.append(value)

           transposed_line = output_delimiter.join(temporary_line) + "\n"

           transposed_file.write(transposed_line)

           original_file.seek(0)
ADD COMMENT
0
Entering edit mode
6.9 years ago
Naren ▴ 1000

Try simple perl script here : https://github.com/9aren/Transpose_Matrix/blob/master/Transpose_tsv.pl
(No idea if it is memory efficient or not)

ADD COMMENT
0
Entering edit mode

That script explicitly reads the entire file into an array up front, so no, it is not memory-efficient and doesn't answer the question.

ADD REPLY

Login before adding your answer.

Traffic: 1629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6