Question

How to create a summary statistics data table for omics data?

1

Entering edit mode

6.7 years ago

WUSCHEL ▴ 860

Hi, I have a big data frame for omics data. Samples are named as Genotype_Time_Replicate (e.g. AOX_1h_4). Each sample has 4 replicates for each time point.

E.g. data set

df <- structure(list(AGI = c("ATCG01240", "ATCG01310", "ATMG00070"), aox2_0h__1 = c(15.79105291, 14.82652303, 14.70630068), aox2_0h__2 = c(16.06494674, 14.50610036, 14.52189807), aox2_0h__3 = c(14.64596287, 14.73266459, 13.07143141), aox2_0h__4 = c(15.71713641, 15.15430026, 16.32190068 ), aox2_12h__1 = c(14.99030606, 15.08046949, 15.8317372), aox2_12h__2 = c(15.15569857, 14.98996474, 14.64862254), aox2_12h__3 = c(15.12144791, 14.90111092, 14.59618842), aox2_12h__4 = c(14.25648197, 15.09832061, 14.64442686), aox2_24h__1 = c(15.23997241, 14.80968391, 14.22573239 ), aox2_24h__2 = c(15.57551513, 14.94861669, 15.18808897), aox2_24h__3 = c(15.04928714, 14.83758685, 13.06948037), aox2_24h__4 = c(14.79035385, 14.93873234, 14.70402827), aox5_0h__1 = c(15.8245918, 14.9351844, 14.67678306), aox5_0h__2 = c(15.75108628, 14.85867002, 14.45704948 ), aox5_0h__3 = c(14.36545859, 14.79296855, 14.82177912), aox5_0h__4 = c(14.80626019, 13.43330964, 16.33482718), aox5_12h__1 = c(14.66327372, 15.22571466, 16.17761867), aox5_12h__2 = c(14.58089039, 14.98545497, 14.4331578), aox5_12h__3 = c(14.58091828, 14.86139511, 15.83898617 ), aox5_12h__4 = c(14.48097297, 15.1420725, 13.39369381), aox5_24h__1 = c(15.41855602, 14.9890092, 13.92629626), aox5_24h__2 = c(15.78386057, 15.19372889, 14.63254456), aox5_24h__3 = c(15.55321382, 14.82013321, 15.74324956), aox5_24h__4 = c(14.53085803, 15.12196994, 14.81028556 ), WT_0h__1 = c(14.0535031, 12.45484834, 14.89102226), WT_0h__2 = c(13.64720361, 15.07144643, 14.99836235), WT_0h__3 = c(14.28295759, 13.75283646, 14.98220861), WT_0h__4 = c(14.79637443, 15.1108037, 15.21711524 ), WT_12h__1 = c(15.05711898, 13.33689777, 14.81064042), WT_12h__2 = c(14.83846779, 13.62497318, 14.76356308), WT_12h__3 = c(14.77215863, 14.72814995, 13.0835214), WT_12h__4 = c(14.70685445, 14.98527337, 16.12727292), WT_24h__1 = c(15.43813077, 14.56918572, 14.92146565 ), WT_24h__2 = c(16.05986898, 14.70583866, 15.64566505), WT_24h__3 = c(14.87721853, 13.22461859, 16.34119942), WT_24h__4 = c(14.92822133, 14.74382383, 12.79146694)), class = "data.frame", row.names = c(NA, -3L))

Please bear with me. I have to summarize the data for each time point; Mean, SE and do a multiple comparison (t-test; i.e. WT-aox2, WT-aox5, aox2-aox5). Then create a table as below figure.

My real df has more genotypes and time points, so difficult to work in Excel.

How can I do this in R? Could anyone help me with this?

RNA-Seq r R statistics Proteomics • 2.8k views

ADD COMMENT • link updated 6.7 years ago by Chirag Parsania ★ 2.0k • written 6.7 years ago by WUSCHEL ▴ 860

1

Entering edit mode

What have you tried? Did you do any basic tutorials for R? No offense, but if I search google I can find really a lot of tutorials in R programming including basic functions such as mean calculations, etc.

ADD REPLY • link 6.7 years ago by Benn 8.4k

1

Entering edit mode

Also please elaborate on what you actually want to do (why do you want to calculate a t-test?) as we might know better ways of doing it :-)

ADD REPLY • link 6.7 years ago by Kristoffer Vitting-Seerup ★ 4.2k

0

Entering edit mode

Hi Kristoffer, it doesn't have to be t-test. May be Posthoc multiple comparision also fine. This is the table format my supervisor preferred. Would be great if you could help me with his.

ADD REPLY • link 6.7 years ago by WUSCHEL ▴ 860

1

Entering edit mode

A good starting point would be to reshape your data from wide-to-long.

ADD REPLY • link 6.7 years ago by zx8754 12k

0

Entering edit mode

Hello BIOAWY!

It appears that your post has been cross-posted to another site: https://stackoverflow.com/questions/54764591/

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link 6.7 years ago by Pierre Lindenbaum 166k

GenoMax · Accepted Answer · 2019-02-19

I showed few tactics to simplify and visualise data in R using tidyverse You can explore more by taking this as start point.

library(tidyverse)
ss <- df %>% 
as_tibble() %>% 
gather(key = "cond" , value = "value" , -AGI) %>%  ## wide to long format
separate(cond , into = c("Genotype", "Time", "Replicate") , sep = "_+") ## separate each attribute 


# A tibble: 108 x 5
   AGI       Genotype Time  Replicate value
   <chr>     <chr>    <chr> <chr>     <dbl>
 1 ATCG01240 aox2     0h    1          15.8
 2 ATCG01310 aox2     0h    1          14.8
 3 ATMG00070 aox2     0h    1          14.7
 4 ATCG01240 aox2     0h    2          16.1
 5 ATCG01310 aox2     0h    2          14.5
 6 ATMG00070 aox2     0h    2          14.5
 7 ATCG01240 aox2     0h    3          14.6
 8 ATCG01310 aox2     0h    3          14.7
 9 ATMG00070 aox2     0h    3          13.1
10 ATCG01240 aox2     0h    4          15.7
# … with 98 more rows


## average out replicates 
ss_m <- ss %>% group_by(AGI, Genotype , Time) %>% summarise(replicates_mean = mean(value))

ss_m
# A tibble: 27 x 4
# Groups:   AGI, Genotype [?]
   AGI       Genotype Time  replicates_mean
   <chr>     <chr>    <chr>           <dbl>
 1 ATCG01240 aox2     0h               15.6
 2 ATCG01240 aox2     12h              14.9
 3 ATCG01240 aox2     24h              15.2
 4 ATCG01240 aox5     0h               15.2
 5 ATCG01240 aox5     12h              14.6
 6 ATCG01240 aox5     24h              15.3
 7 ATCG01240 WT       0h               14.2
 8 ATCG01240 WT       12h              14.8
 9 ATCG01240 WT       24h              15.3
10 ATCG01310 aox2     0h               14.8
# … with 17 more rows

comparing timepoint

bplot <- ss_m %>% ggplot() + geom_boxplot(aes(x = Time, y = replicates_mean , fill = Genotype)) +  theme_bw() + theme(text = element_text(size = 20))
ggsave(filename = "boxplot.png" ,plot = bplot)

comparing genotype

bplot2 <- ss_m %>% ggplot() + geom_boxplot(aes(x = Genotype, y = replicates_mean , fill = Time)) +  theme_bw() + theme(text = element_text(size = 20))
ggsave(filename = "boxplot2.png" ,plot = bplot2)

======================Update ===============================

Convert the data in to the format you asked. ( std dev and mean only)

ss_mm <- ss %>% group_by(AGI, Genotype , Time) %>% 
        summarise(replicates_mean = mean(value) , stddev = sd(value)) %>% ## add stddev and mean 
        unite(Genotype, Time , col = "Genotype_Time" , sep = "_") %>% ## unite genotype and time in a single column
        gather(key = summary_type , value = value , replicates_mean , stddev) %>% ## create summary_type variable 
        unite(Genotype_Time, summary_type , col = "Genotype_Time_summary_type",sep = "_") %>% ##create Genotype_Time_summary_type variable
        spread(Genotype_Time_summary_type , value) ## wide format 

## summary of final table. 
glimpse(ss_mm)

Observations: 3
Variables: 19
Groups: AGI [3]
$ AGI                      <chr> "ATCG01240", "ATCG01310", "ATMG00070"
$ aox2_0h_replicates_mean  <dbl> 15.55477, 14.80490, 14.65538
$ aox2_0h_stddev           <dbl> 0.6240735, 0.2689779, 1.3299868
$ aox2_12h_replicates_mean <dbl> 14.88098, 15.01747, 14.93024
$ aox2_12h_stddev          <dbl> 0.42239203, 0.09092439, 0.60146632
$ aox2_24h_replicates_mean <dbl> 15.16378, 14.88365, 14.29683
$ aox2_24h_stddev          <dbl> 0.33059885, 0.07035039, 0.90767009
$ aox5_0h_replicates_mean  <dbl> 15.18685, 14.50503, 15.07261
$ aox5_0h_stddev           <dbl> 0.7175443, 0.7168420, 0.8547323
$ aox5_12h_replicates_mean <dbl> 14.57651, 15.05366, 14.96086
$ aox5_12h_stddev          <dbl> 0.07459644, 0.16231378, 1.28919700
$ aox5_24h_replicates_mean <dbl> 15.32162, 15.03121, 14.77809
$ aox5_24h_stddev          <dbl> 0.5483318, 0.1643006, 0.7481768
$ WT_0h_replicates_mean    <dbl> 14.19501, 14.09748, 15.02218
$ WT_0h_stddev             <dbl> 0.4794059, 1.2639163, 0.1382836
$ WT_12h_replicates_mean   <dbl> 14.84365, 14.16882, 14.69625
$ WT_12h_stddev            <dbl> 0.1521183, 0.8097963, 1.2471750
$ WT_24h_replicates_mean   <dbl> 15.32586, 14.31087, 14.92495
$ WT_24h_stddev            <dbl> 0.5509899, 0.7280381, 1.5358987