restructure rows and columns in perl or python (Interleaving columns by row pairs)
1
0
Entering edit mode
7.6 years ago
Ana ▴ 200

Hi all, I have a SNPsfile (containing 11 millions SNPs) which I was using to create covariance matrix in Bayenv, so each column in this file corresponds populations and rows are SNPs, but for every SNP I have 2 rows (for two alleles), look like below (2 * nsnps "rows" and npops "columns"):

7        2       2       0        6        2       2
1        0       0       0        0        0       0
0        2       2       0        0        0       0
1        0       0       0        0        0       0

So in the example above I have 7 populations (columns) and 2 SNPs (rows). I need to modify the format of this file a bit. In the new file each row should correspond to one SNP and the number of columns should be twice the number of populations because each pair of numbers corresponds to each allele. So the new file should look like this ( nsnps "rows" and 2*npops "columns"):

7   1   2    0    2   0    0   0    6   0   2   0   2   0
0   1   2    0    2   0    0   0    0   0   0   0   0   0

I have Rcodes which do this manipulation job for me, but it seems that R is so slow, I just want to ask can anyone help me to figure out if there is anyway to do it in Perl or Python. I am new to both of them, I would appreciate any help to fix this issue. Thanks

perl rows columns dataframe python • 3.5k views
ADD COMMENT
0
Entering edit mode

Can you show your R code and tell the size of your matrix and how much RAM you have available? Transposing a matrix should be quick in R, unless your matrix is too big and you are swapping to disk.

ADD REPLY
0
Entering edit mode

It's not exactly transposing.

ADD REPLY
0
Entering edit mode

You are right, it is not near transposing.

ADD REPLY
0
Entering edit mode

Which is great because there is no need to load the entire matrix.

ADD REPLY
0
Entering edit mode

is it me or someone who did not understand the output format. I feel so noob and still, cannot figure out what the OP wanted. Am glad Wouter figured it out but I would be glad if I can understand what the OP is trying to achieve. It will be nice to learn something new. :)

ADD REPLY
1
Entering edit mode

Ah, I understood now the format and what the OP is trying to achieve. Actually, it is not replacing rows to the column to its entirety. If the moderator could help in changing the question else it will be misleading.

ADD REPLY
1
Entering edit mode

I changed it to "restructure", can't think of anything more specific.

ADD REPLY
0
Entering edit mode

yes, it is much better now and a reader will not be misled. Thanks, @Wouter. At least other readers will simply not copy the code rather read the query posted if they need any help with this post.

ADD REPLY
0
Entering edit mode

Interleaving columns by row pairs? Interleaving columns by every two rows?

ADD REPLY
0
Entering edit mode

That's a good one! ;)

ADD REPLY
7
Entering edit mode
7.6 years ago

I think this should do the job:

Save as rearrangingalleles.py and execute as python rearrangingalleles.py myinput.txt > myoutput.txt

ADD COMMENT
2
Entering edit mode

More profesional use .write:

import sys
output = open('output.txt', 'w')
with open(sys.argv[1]) as input:
  while True:
    line1 = [item for item in input.readline().split()]
    if len(line1) == 0:
        break
    line2 = [item for item in input.readline().split()]
    output.write(' '.join([line1[i] + " " + line2[i] for i in range(len(line1))]) + '\n')
print '\n' + '\t' + 'Job completed!'

Save as rearrangingalleles.py and execute as python rearrangingalleles.py myinput.txt

ADD REPLY
3
Entering edit mode

I firmly disagree.
It's a very convenient feature if scripts write to stdout, as such you can use them when piping. Also, it allows specifying both the output name and output directory.

While talking about more professional:

  • didn't close the output file
  • you should use the 'with' statement for opening files

e.g.:

with open(sys.argv[1]) as input, open('output.txt', 'w') as ouput:
ADD REPLY
2
Entering edit mode

Also, it allows specifying both the output name and output directory.

prefix = sys.argv[1].split('.')[0]
output = open(prefix + '_output.txt', 'w')    #you will never have to specify output name or directory

you should use the 'with' statement for opening files

??? Why?

ADD REPLY
2
Entering edit mode

It's nice that you defend your opinion, but I would suggest considering you might be wrong.

Your code will crash if the current directory is not writeable for the user, which is not an uncommon situation for directories with (shared) data files. Also, the user doesn't have the freedom to

  • Stream the output to another command on stdin (fundamental concept in unix pipelines)
  • Choose the output name and directory of their choice

With regard to "why using the with statement" see this page of Python for beginners.

ADD REPLY
2
Entering edit mode

You are a very funny person WouterDeCoster :). I'm glad you're learning to program but, in addition to online courses I recommend using common sense, sometimes it is very useful!

Best.

ADD REPLY
3
Entering edit mode

Due to the limited meta-communication online I'm not sure if you are just trolling me or are genuinely an arrogant fool. Anyway, thanks for the advice.

Have a nice day.

ADD REPLY
0
Entering edit mode

Thanks so much WouterDeCoster, it produced exactly the file I wished in less than 1 minute.

ADD REPLY
0
Entering edit mode

If my answer resolved your question you can mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLY

Login before adding your answer.

Traffic: 2428 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6