Question

Converting A File With Rows And Columns To Just Columns

1

Entering edit mode

10.7 years ago

mphillips6789 ▴ 10

I have a file with entries that look like this:

Pos/Line    p
148    

A    0
C    0
G    0.081985
T    0.918015
207    

A    0.021697
C    0.978303
G    0
T    0

I need to convert this to something that looks like:

Pos    A    C    G    T
148    0    0    0.081985    0.918015
207    0.021697    0.978303    0    0

So, my "Pos" entries are more or less already in a column. However, I need to convert the A, C, G, T rows to columns.

Any help would be appreciated.

• 4.3k views

ADD COMMENT • link updated 10.7 years ago by wdiwdi ▴ 380 • written 10.7 years ago by mphillips6789 ▴ 10

score 4 · Answer 1 · 2014-03-04

This is a textbook of example of one of the reasons perl was created. If your file is completely regular you could write a few lines of perl to loop through the file and do something whenever it encounters a line starting with a number. For instance:

#!/usr/bin/perl                                                                                                                                          
print "Pos\tA\tC\tG\tT\n";
while(<>){
    chop;
    if(/^\d+/){
        $number = $_;
        @values = ($number);
        <>;
        for($i=0; $i<4;++$i){
            $_ = <>;
            chop;
            ($base,$value) = split();
            push(@values,$value);
        }
        print join("\t", @values), "\n";
    }
}

The code above would work to parse your little snippet, and format it the way you've shown above. But it assumes your file is structured in a completely regular way. If the code above were in a file called parse.pl, and your data was in a file called foo.txt, you would call it like so:

./parse.pl foo.txt

and to dump the results to a new file:

./parse.pl foo.txt > newfile.txt

If you're unfamiliar with perl, here's what's happening: print a header line like you have above, then loop through the file one line at a time, the <> symbols grab a line from the file and place it into a variable called: $_. The chop function cuts off the last character of the line (the "newline"). The if statement tests to see if the line begins with 1 or more digits (many functions like chop, split, pattern matching, etc. operate on $_ implicitly unless another variable is handed to them explicitly). If the line begins with digits, remember the digit, and start a list of values. Grab the next line, which should be empty, and don't save it to anything (thus discarding it). Then set up a loop to process the next four lines: remove the end character, split each line by white space saving the values, and push each value onto the list of values that was created previously. After 4 lines, print the contents of the list, joined by a tab character, followed by a newline. Repeat until there are no more lines in the file!

There are a variety of ways to solve your problem. An awk solution would also be easy to code. But with a few principles from perl that could be learned in an afternoon or two, you can reshape your file. (some gurus might find the code above cringe worthy, but it gets the job done).

score 3 · Answer 2 · 2014-03-05

3

Entering edit mode

10.7 years ago

wdiwdi ▴ 380

The Perl solution is overkilll. This problem can be solved in a more readable fashion with a tiny awk script:

 BEGIN { print("Pos\tA\tC\tG\tT") }
  /^[0-9]/        { printf("%s\t",$1)}
  /^[ACG]/ { printf("%s\t",$2)}
  /^T/    { printf("%s\n",$2)}

run as "gawk -f myscript.awk myinfile.txt >myoutfile.txt"

ADD COMMENT • link 10.7 years ago by wdiwdi ▴ 380

0

Entering edit mode

No solution is overkill or readable if one knows of no other solution or language. I think we can safely assume mphillips6789 knows neither awk nor perl (I did mention awk as a possibility in my response). For the edification of those who know neither, the notion of readability is interesting, and they shouldn't miss the common elements, the idea of using // to specify patterns to match by line, {} to hold blocks of code, and putting things in variables starting with $.

ADD REPLY • link 10.7 years ago by seidel 11k

0

Entering edit mode

Thank you, problem solved. Looking at both solutions was educational in and of itself.

ADD REPLY • link 10.7 years ago by mphillips6789 ▴ 10