How to create a summary statistics data table for omics data?
1
1
Entering edit mode
5.9 years ago
WUSCHEL ▴ 810

Hi, I have a big data frame for omics data. Samples are named as Genotype_Time_Replicate (e.g. AOX_1h_4). Each sample has 4 replicates for each time point.

E.g. data set

df <- structure(list(AGI = c("ATCG01240", "ATCG01310", "ATMG00070"), aox2_0h__1 = c(15.79105291, 14.82652303, 14.70630068), aox2_0h__2 = c(16.06494674, 14.50610036, 14.52189807), aox2_0h__3 = c(14.64596287, 14.73266459, 13.07143141), aox2_0h__4 = c(15.71713641, 15.15430026, 16.32190068 ), aox2_12h__1 = c(14.99030606, 15.08046949, 15.8317372), aox2_12h__2 = c(15.15569857, 14.98996474, 14.64862254), aox2_12h__3 = c(15.12144791, 14.90111092, 14.59618842), aox2_12h__4 = c(14.25648197, 15.09832061, 14.64442686), aox2_24h__1 = c(15.23997241, 14.80968391, 14.22573239 ), aox2_24h__2 = c(15.57551513, 14.94861669, 15.18808897), aox2_24h__3 = c(15.04928714, 14.83758685, 13.06948037), aox2_24h__4 = c(14.79035385, 14.93873234, 14.70402827), aox5_0h__1 = c(15.8245918, 14.9351844, 14.67678306), aox5_0h__2 = c(15.75108628, 14.85867002, 14.45704948 ), aox5_0h__3 = c(14.36545859, 14.79296855, 14.82177912), aox5_0h__4 = c(14.80626019, 13.43330964, 16.33482718), aox5_12h__1 = c(14.66327372, 15.22571466, 16.17761867), aox5_12h__2 = c(14.58089039, 14.98545497, 14.4331578), aox5_12h__3 = c(14.58091828, 14.86139511, 15.83898617 ), aox5_12h__4 = c(14.48097297, 15.1420725, 13.39369381), aox5_24h__1 = c(15.41855602, 14.9890092, 13.92629626), aox5_24h__2 = c(15.78386057, 15.19372889, 14.63254456), aox5_24h__3 = c(15.55321382, 14.82013321, 15.74324956), aox5_24h__4 = c(14.53085803, 15.12196994, 14.81028556 ), WT_0h__1 = c(14.0535031, 12.45484834, 14.89102226), WT_0h__2 = c(13.64720361, 15.07144643, 14.99836235), WT_0h__3 = c(14.28295759, 13.75283646, 14.98220861), WT_0h__4 = c(14.79637443, 15.1108037, 15.21711524 ), WT_12h__1 = c(15.05711898, 13.33689777, 14.81064042), WT_12h__2 = c(14.83846779, 13.62497318, 14.76356308), WT_12h__3 = c(14.77215863, 14.72814995, 13.0835214), WT_12h__4 = c(14.70685445, 14.98527337, 16.12727292), WT_24h__1 = c(15.43813077, 14.56918572, 14.92146565 ), WT_24h__2 = c(16.05986898, 14.70583866, 15.64566505), WT_24h__3 = c(14.87721853, 13.22461859, 16.34119942), WT_24h__4 = c(14.92822133, 14.74382383, 12.79146694)), class = "data.frame", row.names = c(NA, -3L))

Please bear with me. I have to summarize the data for each time point; Mean, SE and do a multiple comparison (t-test; i.e. WT-aox2, WT-aox5, aox2-aox5). Then create a table as below figure.

Picture1

My real df has more genotypes and time points, so difficult to work in Excel.

How can I do this in R? Could anyone help me with this?

RNA-Seq r R statistics Proteomics • 2.1k views
ADD COMMENT
1
Entering edit mode

What have you tried? Did you do any basic tutorials for R? No offense, but if I search google I can find really a lot of tutorials in R programming including basic functions such as mean calculations, etc.

ADD REPLY
1
Entering edit mode

Also please elaborate on what you actually want to do (why do you want to calculate a t-test?) as we might know better ways of doing it :-)

ADD REPLY
0
Entering edit mode

Hi Kristoffer, it doesn't have to be t-test. May be Posthoc multiple comparision also fine. This is the table format my supervisor preferred. Would be great if you could help me with his.

ADD REPLY
1
Entering edit mode

A good starting point would be to reshape your data from wide-to-long.

ADD REPLY
0
Entering edit mode

Hello BIOAWY!

It appears that your post has been cross-posted to another site: https://stackoverflow.com/questions/54764591/

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
4
Entering edit mode
5.9 years ago
Chirag Parsania ★ 2.0k

I showed few tactics to simplify and visualise data in R using tidyverse You can explore more by taking this as start point.

library(tidyverse)
ss <- df %>% 
as_tibble() %>% 
gather(key = "cond" , value = "value" , -AGI) %>%  ## wide to long format
separate(cond , into = c("Genotype", "Time", "Replicate") , sep = "_+") ## separate each attribute 


# A tibble: 108 x 5
   AGI       Genotype Time  Replicate value
   <chr>     <chr>    <chr> <chr>     <dbl>
 1 ATCG01240 aox2     0h    1          15.8
 2 ATCG01310 aox2     0h    1          14.8
 3 ATMG00070 aox2     0h    1          14.7
 4 ATCG01240 aox2     0h    2          16.1
 5 ATCG01310 aox2     0h    2          14.5
 6 ATMG00070 aox2     0h    2          14.5
 7 ATCG01240 aox2     0h    3          14.6
 8 ATCG01310 aox2     0h    3          14.7
 9 ATMG00070 aox2     0h    3          13.1
10 ATCG01240 aox2     0h    4          15.7
# … with 98 more rows


## average out replicates 
ss_m <- ss %>% group_by(AGI, Genotype , Time) %>% summarise(replicates_mean = mean(value))

ss_m
# A tibble: 27 x 4
# Groups:   AGI, Genotype [?]
   AGI       Genotype Time  replicates_mean
   <chr>     <chr>    <chr>           <dbl>
 1 ATCG01240 aox2     0h               15.6
 2 ATCG01240 aox2     12h              14.9
 3 ATCG01240 aox2     24h              15.2
 4 ATCG01240 aox5     0h               15.2
 5 ATCG01240 aox5     12h              14.6
 6 ATCG01240 aox5     24h              15.3
 7 ATCG01240 WT       0h               14.2
 8 ATCG01240 WT       12h              14.8
 9 ATCG01240 WT       24h              15.3
10 ATCG01310 aox2     0h               14.8
# … with 17 more rows

comparing timepoint

bplot <- ss_m %>% ggplot() + geom_boxplot(aes(x = Time, y = replicates_mean , fill = Genotype)) +  theme_bw() + theme(text = element_text(size = 20))
ggsave(filename = "boxplot.png" ,plot = bplot)

boxplot

comparing genotype

bplot2 <- ss_m %>% ggplot() + geom_boxplot(aes(x = Genotype, y = replicates_mean , fill = Time)) +  theme_bw() + theme(text = element_text(size = 20))
ggsave(filename = "boxplot2.png" ,plot = bplot2)

boxplot2

======================Update ===============================

Convert the data in to the format you asked. ( std dev and mean only)

ss_mm <- ss %>% group_by(AGI, Genotype , Time) %>% 
        summarise(replicates_mean = mean(value) , stddev = sd(value)) %>% ## add stddev and mean 
        unite(Genotype, Time , col = "Genotype_Time" , sep = "_") %>% ## unite genotype and time in a single column
        gather(key = summary_type , value = value , replicates_mean , stddev) %>% ## create summary_type variable 
        unite(Genotype_Time, summary_type , col = "Genotype_Time_summary_type",sep = "_") %>% ##create Genotype_Time_summary_type variable
        spread(Genotype_Time_summary_type , value) ## wide format 

## summary of final table. 
glimpse(ss_mm)

Observations: 3
Variables: 19
Groups: AGI [3]
$ AGI                      <chr> "ATCG01240", "ATCG01310", "ATMG00070"
$ aox2_0h_replicates_mean  <dbl> 15.55477, 14.80490, 14.65538
$ aox2_0h_stddev           <dbl> 0.6240735, 0.2689779, 1.3299868
$ aox2_12h_replicates_mean <dbl> 14.88098, 15.01747, 14.93024
$ aox2_12h_stddev          <dbl> 0.42239203, 0.09092439, 0.60146632
$ aox2_24h_replicates_mean <dbl> 15.16378, 14.88365, 14.29683
$ aox2_24h_stddev          <dbl> 0.33059885, 0.07035039, 0.90767009
$ aox5_0h_replicates_mean  <dbl> 15.18685, 14.50503, 15.07261
$ aox5_0h_stddev           <dbl> 0.7175443, 0.7168420, 0.8547323
$ aox5_12h_replicates_mean <dbl> 14.57651, 15.05366, 14.96086
$ aox5_12h_stddev          <dbl> 0.07459644, 0.16231378, 1.28919700
$ aox5_24h_replicates_mean <dbl> 15.32162, 15.03121, 14.77809
$ aox5_24h_stddev          <dbl> 0.5483318, 0.1643006, 0.7481768
$ WT_0h_replicates_mean    <dbl> 14.19501, 14.09748, 15.02218
$ WT_0h_stddev             <dbl> 0.4794059, 1.2639163, 0.1382836
$ WT_12h_replicates_mean   <dbl> 14.84365, 14.16882, 14.69625
$ WT_12h_stddev            <dbl> 0.1521183, 0.8097963, 1.2471750
$ WT_24h_replicates_mean   <dbl> 15.32586, 14.31087, 14.92495
$ WT_24h_stddev            <dbl> 0.5509899, 0.7280381, 1.5358987
ADD COMMENT
0
Entering edit mode

Thank you Chirag. Greatly appreciate your help. This will be really helpful for me.

However, my supervisor need a table as I've illustrated, with multiple comparison p-values. May I ask if you could guide me with this. Thanks again.

ADD REPLY
1
Entering edit mode

I'm sure you will be able to explore from the code I posted. For more reference regarding to pvalue, error bar and other statistics refere this.

ADD REPLY
1
Entering edit mode

Check my updates in the answer.

ADD REPLY
0
Entering edit mode

Thank you Chirag :)

ADD REPLY
0
Entering edit mode

Hello Wuschel,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work. Upvote|Bookmark|Accept

ADD REPLY

Login before adding your answer.

Traffic: 2095 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6