copy number file format issue
1
0
Entering edit mode
6.4 years ago

Hello, I have a problem with a .csv file from Copy number data. The original looks like this:

genes               Log2
PIK3CA,TET2          -0.35
MLH2,NRAS            0.54

And, what I need is:

genes                 Log2
PIK3CA              -0.35
TET2                 -0.35
MLH2                0.54
NRAS                 0.54

I have tried many things by now, and they have not been successful. The file was created with CNVkit from gastric cancer samples. The file is much bigger, and the list of genes is longer, but this is essentially what I need to do in order to analyze our cnv data.

I use Linux, Ubuntu V 16.04. I would appreciate if you could help me with an R or Python script, but, by now, any solution would be good.

Thank you

cnv copynumber R python • 1.9k views
ADD COMMENT
1
Entering edit mode
  1. Please remove the text in bold, it does not make sense to have a full post in bold.
  2. Please select a more descriptive title for your thread. You want to reformat a file, the fact that this is copy number analysis data is less relevant in the title
  3. Explain what you tried and how that didn't work, we'll be more eager to point out your mistake and get you back on track.
ADD REPLY
0
Entering edit mode

Adding to Wouter's comment, please explain if you're tried awk. Plus, what have you tried using Python/R?

ADD REPLY
0
Entering edit mode

Thanks! I was trying with this mostly:

awk -F , -v OFS='\t' 'NR == 1 || $0 > 0 {print $4}' AGM3.call.prueba.cns.csv |less

But, it doesn't work well. And also, I need to repeat the Log2 value for each gene in the row (in the comma separated list). Would transpose the columns work for this?

ADD REPLY
0
Entering edit mode

This can be done in awk esp format in OP. Use split and loop. @OP

ADD REPLY
0
Entering edit mode

That is not what a csv file looks like. If it were me I would do this with a python script because it's a bit messy.

ADD REPLY
0
Entering edit mode

The format is not really the problem, I can export it to any other format, but, the genes column looks the same. How can I do it with Python?

ADD REPLY
0
Entering edit mode

To me it is a problem because you are showing PIK3CA,TET2 as a single column in a csv, even though there is a comma separating them. But then you also show the columns separated by tabs(?). If I could see the exact structure of the file I could write up something quick in python.

ADD REPLY
0
Entering edit mode

Ok, well, the original file is .cns, it is a text file, that looks like this (first line plus header):

chromosome,start,end,gene,log2
chr1,13402,861395,"LOC102725121,DDX11L1,OR4F5,LOC100133331,LOC100132062,LOC100132287,LOC100133331,LINC00115,SAMD11",-0.28067

We transformed it to a .csv so we could separate it by tab. The .cns is separated by commas but genes are a single string delimited by quotes. I hope this is more useful

ADD REPLY
0
Entering edit mode

Take a look at csvkit - it works well with quoted CSV columns. Or, use R to read stuff into a 2d array, then create another 2D array by splitting the 4th column and assigning the 5th col value to each part of the 4th column.

ADD REPLY
1
Entering edit mode
6.4 years ago

output:

$ awk -v OFS="\t" '{split ($1,a,",")} {for (i in a) {print a[i],$2}}' test.txt 
genes   Log2
PIK3CA  -0.35
TET2    -0.35
MLH2    0.54
NRAS    0.54

input:

$ cat test.txt 
genes   Log2
PIK3CA,TET2 -0.35
MLH2,NRAS   0.54
ADD COMMENT

Login before adding your answer.

Traffic: 2301 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6