Entering edit mode
6.6 years ago
jim.paredes
•
0
Hello, I have a problem with a .csv file from Copy number data. The original looks like this:
genes Log2
PIK3CA,TET2 -0.35
MLH2,NRAS 0.54
And, what I need is:
genes Log2
PIK3CA -0.35
TET2 -0.35
MLH2 0.54
NRAS 0.54
I have tried many things by now, and they have not been successful. The file was created with CNVkit from gastric cancer samples. The file is much bigger, and the list of genes is longer, but this is essentially what I need to do in order to analyze our cnv data.
I use Linux, Ubuntu V 16.04. I would appreciate if you could help me with an R or Python script, but, by now, any solution would be good.
Thank you
Adding to Wouter's comment, please explain if you're tried
awk
. Plus, what have you tried using Python/R?Thanks! I was trying with this mostly:
But, it doesn't work well. And also, I need to repeat the Log2 value for each gene in the row (in the comma separated list). Would transpose the columns work for this?
This can be done in awk esp format in OP. Use split and loop. @OP
That is not what a csv file looks like. If it were me I would do this with a python script because it's a bit messy.
The format is not really the problem, I can export it to any other format, but, the genes column looks the same. How can I do it with Python?
To me it is a problem because you are showing
PIK3CA,TET2
as a single column in a csv, even though there is a comma separating them. But then you also show the columns separated by tabs(?). If I could see the exact structure of the file I could write up something quick in python.Ok, well, the original file is .cns, it is a text file, that looks like this (first line plus header):
We transformed it to a .csv so we could separate it by tab. The .cns is separated by commas but genes are a single string delimited by quotes. I hope this is more useful
Take a look at csvkit - it works well with quoted CSV columns. Or, use R to read stuff into a 2d array, then create another 2D array by splitting the 4th column and assigning the 5th col value to each part of the 4th column.