Question

copy number file format issue

0

Entering edit mode

6.6 years ago

jim.paredes • 0

Hello, I have a problem with a .csv file from Copy number data. The original looks like this:

genes               Log2
PIK3CA,TET2          -0.35
MLH2,NRAS            0.54

And, what I need is:

genes                 Log2
PIK3CA              -0.35
TET2                 -0.35
MLH2                0.54
NRAS                 0.54

I have tried many things by now, and they have not been successful. The file was created with CNVkit from gastric cancer samples. The file is much bigger, and the list of genes is longer, but this is essentially what I need to do in order to analyze our cnv data.

I use Linux, Ubuntu V 16.04. I would appreciate if you could help me with an R or Python script, but, by now, any solution would be good.

Thank you

cnv copynumber R python • 2.0k views

ADD COMMENT • link updated 6.6 years ago by cpad0112 21k • written 6.6 years ago by jim.paredes • 0

1

Entering edit mode

Please remove the text in bold, it does not make sense to have a full post in bold.
Please select a more descriptive title for your thread. You want to reformat a file, the fact that this is copy number analysis data is less relevant in the title
Explain what you tried and how that didn't work, we'll be more eager to point out your mistake and get you back on track.

ADD REPLY • link 6.6 years ago by WouterDeCoster 47k

0

Entering edit mode

Adding to Wouter's comment, please explain if you're tried awk. Plus, what have you tried using Python/R?

ADD REPLY • link 6.6 years ago by Ram 44k

0

Entering edit mode

Thanks! I was trying with this mostly:

awk -F , -v OFS='\t' 'NR == 1 || $0 > 0 {print $4}' AGM3.call.prueba.cns.csv |less

But, it doesn't work well. And also, I need to repeat the Log2 value for each gene in the row (in the comma separated list). Would transpose the columns work for this?

ADD REPLY • link updated 6.6 years ago by GenoMax 148k • written 6.6 years ago by jim.paredes • 0

0

Entering edit mode

This can be done in awk esp format in OP. Use split and loop. @OP

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

That is not what a csv file looks like. If it were me I would do this with a python script because it's a bit messy.

ADD REPLY • link 6.6 years ago by goodez ▴ 640

0

Entering edit mode

The format is not really the problem, I can export it to any other format, but, the genes column looks the same. How can I do it with Python?

ADD REPLY • link 6.6 years ago by jim.paredes • 0

0

Entering edit mode

To me it is a problem because you are showing PIK3CA,TET2 as a single column in a csv, even though there is a comma separating them. But then you also show the columns separated by tabs(?). If I could see the exact structure of the file I could write up something quick in python.

ADD REPLY • link 6.6 years ago by goodez ▴ 640

0

Entering edit mode

Ok, well, the original file is .cns, it is a text file, that looks like this (first line plus header):

chromosome,start,end,gene,log2
chr1,13402,861395,"LOC102725121,DDX11L1,OR4F5,LOC100133331,LOC100132062,LOC100132287,LOC100133331,LINC00115,SAMD11",-0.28067

We transformed it to a .csv so we could separate it by tab. The .cns is separated by commas but genes are a single string delimited by quotes. I hope this is more useful

ADD REPLY • link updated 6.6 years ago by Ram 44k • written 6.6 years ago by jim.paredes • 0

0

Entering edit mode

Take a look at csvkit - it works well with quoted CSV columns. Or, use R to read stuff into a 2d array, then create another 2D array by splitting the 4th column and assigning the 5th col value to each part of the 4th column.

ADD REPLY • link 6.6 years ago by Ram 44k

score 1 · Answer 1 · 2018-06-27

1

Entering edit mode

6.6 years ago

cpad0112 21k

output:

$ awk -v OFS="\t" '{split ($1,a,",")} {for (i in a) {print a[i],$2}}' test.txt 
genes   Log2
PIK3CA  -0.35
TET2    -0.35
MLH2    0.54
NRAS    0.54

input:

$ cat test.txt 
genes   Log2
PIK3CA,TET2 -0.35
MLH2,NRAS   0.54

ADD COMMENT • link 6.6 years ago by cpad0112 21k