Dear all,
I have a dumb question that bothers me a lot. I have a set of genes with calculated ChIP-seq binding intensities. Notably, most of these genes have more than one value (i.e., because of multiple binding I thought). Now I want to clean these data by calculating the average of biding for each gene. The head of the data likes the following,
A2M -3.63
A2M -18.46
ACE 1.68
ACYP2 8.23
ACYP2 6.09
ADAMTS12 1.77
ADAMTS12 1.65
ADAMTS6 5.36
ADIPOQ 4.06
ADIPOQ 1.71
AGR2 13.71
AGR2 8.92
AGR2 5.05
AGR2 3.02
AGR2 2.23
AGR2 -4.22
I have these data in excel file. Obviously, it is annoying to do manually in Excel because of thousands of genes. How do I do this? I guess a simple script may help. But I have no experience in programming. Your help is much appreciated! Thank you! Xiaoyong Fu Houston, TX
I hate excel but I think i can be easily done with "pivot table" isn't it ? https://support.office.com/en-us/article/Calculate-values-in-a-PivotTable-report-11f41417-da80-435c-a5c6-b0185e59da77
@Pierre recommending using excel for something. Well .. that is a first!
No I didn't ! No I didn't ! That wasn't me !
No worries everyone, I made a screenshot.
You are such a meme person!
Hi Pierre, Thank you so much! I think this is what I need to look for. I will check on it. BTW, is it possible to do so in R? Best, Xiaoyong
Everything can be done in Excel. Unless you go over 64k lines or whatever the latest Microsoft boss thinks is sufficient. But otherwise, everything.
I definitely love the convenience function of converting gene names to dates. Wouldn't know how to work without it!
The only thing you have to do when importing a file is to select the column containing the gene names and set data type to Text.
Yes, that I know. But it's an error which happens easily, silent and irreversible. That shouldn't be possible.
I agree, Excel is a powerful tool to work with. However, it takes some time to really grab its essence and logic. Since Excel 2007 tables can be "1,048,576 rows by 16,384 columns".
Hmm, I wasn't aware of that. But it seems that Excel is incapable of plotting more that 64k lines in a graph. It doesn't really matter to me how many you can put in a table, if you can't show them in a graph; data that can't be visualized may as well not exist.
"1,048,576 rows by 16,384 columns" is still completely arbitrary. And even if you can paste data in those fields, you can't use it in a graph because Microsoft either lacks skilled programmers or managers. There is no sane reason to restrict people to those limits. If I want to paste a billion lines into Excel, I should be able to do so (if I have sufficient RAM), and it baffles me why Microsoft is incapable of developing tools that can handle the needs of anyone outside of pedestrian users. They are stuck in the "640k is enough for anyone" mindset.
I love Excel for copying and pasting data into a sensible matrix so I can then plot it in something pretty (R - ggplot/plotly) ;) also if you're supervisor is among the less tech savvy, they inevitably want all experiments to produce and excel table they can play with
You say some of your genes have multiple values, but your example data doesn't. Can you give us a more representative example input file?
R dplyr:
group_by
->summarise