Question

A possible script request for a task probably cannot be done in Excel

0

Entering edit mode

8.0 years ago

xiaoyonf ▴ 60

Dear all,

I have a dumb question that bothers me a lot. I have a set of genes with calculated ChIP-seq binding intensities. Notably, most of these genes have more than one value (i.e., because of multiple binding I thought). Now I want to clean these data by calculating the average of biding for each gene. The head of the data likes the following,

A2M -3.63
A2M -18.46
ACE 1.68
ACYP2   8.23
ACYP2   6.09
ADAMTS12    1.77
ADAMTS12    1.65
ADAMTS6 5.36
ADIPOQ  4.06
ADIPOQ  1.71
AGR2    13.71
AGR2    8.92
AGR2    5.05
AGR2    3.02
AGR2    2.23
AGR2    -4.22

I have these data in excel file. Obviously, it is annoying to do manually in Excel because of thousands of genes. How do I do this? I guess a simple script may help. But I have no experience in programming. Your help is much appreciated! Thank you! Xiaoyong Fu Houston, TX

excel • 3.3k views

ADD COMMENT • link updated 7.7 years ago by Biostar 20 • written 8.0 years ago by xiaoyonf ▴ 60

3

Entering edit mode

I hate excel but I think i can be easily done with "pivot table" isn't it ? https://support.office.com/en-us/article/Calculate-values-in-a-PivotTable-report-11f41417-da80-435c-a5c6-b0185e59da77

ADD REPLY • link 8.0 years ago by Pierre Lindenbaum 165k

4

Entering edit mode

@Pierre recommending using excel for something. Well .. that is a first!

ADD REPLY • link 8.0 years ago by GenoMax 149k

1

Entering edit mode

No I didn't ! No I didn't ! That wasn't me !

ADD REPLY • link 8.0 years ago by Pierre Lindenbaum 165k

2

Entering edit mode

No worries everyone, I made a screenshot.

ADD REPLY • link 8.0 years ago by WouterDeCoster 47k

0

Entering edit mode

enter image description here

ADD REPLY • link 8.0 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

You are such a meme person!

ADD REPLY • link 8.0 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi Pierre, Thank you so much! I think this is what I need to look for. I will check on it. BTW, is it possible to do so in R? Best, Xiaoyong

ADD REPLY • link 8.0 years ago by xiaoyonf ▴ 60

2

Entering edit mode

Everything can be done in Excel. Unless you go over 64k lines or whatever the latest Microsoft boss thinks is sufficient. But otherwise, everything.

ADD REPLY • link 8.0 years ago by Brian Bushnell 20k

7

Entering edit mode

I definitely love the convenience function of converting gene names to dates. Wouldn't know how to work without it!

ADD REPLY • link 8.0 years ago by WouterDeCoster 47k

0

Entering edit mode

The only thing you have to do when importing a file is to select the column containing the gene names and set data type to Text.

ADD REPLY • link 8.0 years ago by aquaq ▴ 40

0

Entering edit mode

Yes, that I know. But it's an error which happens easily, silent and irreversible. That shouldn't be possible.

ADD REPLY • link 8.0 years ago by WouterDeCoster 47k

0

Entering edit mode

I agree, Excel is a powerful tool to work with. However, it takes some time to really grab its essence and logic. Since Excel 2007 tables can be "1,048,576 rows by 16,384 columns".

ADD REPLY • link 8.0 years ago by aquaq ▴ 40

1

Entering edit mode

Hmm, I wasn't aware of that. But it seems that Excel is incapable of plotting more that 64k lines in a graph. It doesn't really matter to me how many you can put in a table, if you can't show them in a graph; data that can't be visualized may as well not exist.

"1,048,576 rows by 16,384 columns" is still completely arbitrary. And even if you can paste data in those fields, you can't use it in a graph because Microsoft either lacks skilled programmers or managers. There is no sane reason to restrict people to those limits. If I want to paste a billion lines into Excel, I should be able to do so (if I have sufficient RAM), and it baffles me why Microsoft is incapable of developing tools that can handle the needs of anyone outside of pedestrian users. They are stuck in the "640k is enough for anyone" mindset.

ADD REPLY • link 8.0 years ago by Brian Bushnell 20k

0

Entering edit mode

I love Excel for copying and pasting data into a sensible matrix so I can then plot it in something pretty (R - ggplot/plotly) ;) also if you're supervisor is among the less tech savvy, they inevitably want all experiments to produce and excel table they can play with

ADD REPLY • link 8.0 years ago by Joe 22k

0

Entering edit mode

You say some of your genes have multiple values, but your example data doesn't. Can you give us a more representative example input file?

ADD REPLY • link 8.0 years ago by Joe 22k

0

Entering edit mode

R dplyr: group_by -> summarise

ADD REPLY • link 8.0 years ago by shenwei356 8.7k

1

Entering edit mode

8.0 years ago

steve ★ 3.5k

You can also do this easily in R.

# reading in the data; insert your data import method here:
gene_stats <- "gene\tvalue
A2M\t-3.63
A2M\t-18.46
ACE\t1.68
ACYP2\t8.23
ACYP2\t6.09
ADAMTS12\t1.77
ADAMTS12\t1.65
ADAMTS6\t5.36
ADIPOQ\t4.06
ADIPOQ\t1.71
AGR2\t13.71
AGR2\t8.92
AGR2\t5.05
AGR2\t3.02
AGR2\t2.23
AGR2\t-4.22
"
gene_stats_df_df <- read.table(textConnection(gene_stats), header = TRUE, sep = '\t')

# get the mean for each gene in the table
aggregate(value ~ gene, data = gene_stats_df_df, FUN = mean)

output:

      gene   value
1      A2M -11.045
2      ACE   1.680
3    ACYP2   7.160
4 ADAMTS12   1.710
5  ADAMTS6   5.360
6   ADIPOQ   2.885
7     AGR2   4.785

ADD COMMENT • link 8.0 years ago by steve ★ 3.5k

1

Entering edit mode

8.0 years ago

aquaq ▴ 40

You can also copy the first column (A) to another column(D), then apply remove duplicates to this column. Then, in the E1 cell, you only have to use =AVERAGEIF(A:A, D1, B:B). With double click on the lower right corner of the E1 cell, you can copy this function for all the E column.

ADD COMMENT • link 8.0 years ago by aquaq ▴ 40

0

Entering edit mode

Thank you, aquaq. This is I thought the quickest way to solve my problem!

ADD REPLY • link 7.8 years ago by xiaoyonf ▴ 60

1

Entering edit mode

8.0 years ago

arnstrm ★ 1.9k

In Excel, select the table, go to Inset, PivotTable, choose the location (Existing Worksheet and select a blank cell). A side bar section PivotTable Fiedls will appear. Drag the Gene field name to Rows and Value to the values. The default will be the sum of values for each gene. To change it click on the drop down menu and then value field settings and choose average

(suggesting only because Pierre mentioned this as the option :P)

screenshot

ADD COMMENT • link 8.0 years ago by arnstrm ★ 1.9k

0

Entering edit mode

8.0 years ago

theobroma22 ★ 1.2k

X <- aggregate ( x, data, mean)

ADD COMMENT • link 8.0 years ago by theobroma22 ★ 1.2k

score 3 · Accepted Answer · 2017-03-02

3

Entering edit mode

8.0 years ago

GouthamAtla 12k

Assuming the tab delimited file:

sort -k1,1 in.txt | groupBy -g 1 -c 2 -o mean > out.txt

out.txt:

A2M -11.045
ACE 1.68
ACYP2   7.16
ADAMTS12    1.71
ADAMTS6 5.36
ADIPOQ  2.885
AGR2    4.785

groupBy from bedtools

ADD COMMENT • link 8.0 years ago by GouthamAtla 12k

0

Entering edit mode

Hi Goutham, This looks so cool! I will try it definitely. Thank you so much! Xiaoyong

ADD REPLY • link 8.0 years ago by xiaoyonf ▴ 60

0

Entering edit mode

The bedtools groupby command works beautifully to perform such work. Thank you for quick answer! Xiaoyong

ADD REPLY • link 8.0 years ago by xiaoyonf ▴ 60