Question

How do i calculate the mean of triplicates in a data.frame based on pattern?

1

Entering edit mode

10 months ago

Assa Yeroslaviz ★ 1.9k

I have a data.frame with six samples in two triplicates. I would like to calculates for each group the mean of the this group into a new column. What is the best way to do this? My data looks like that

head(p767)
                 sgrna p767.AM_1 p767.AM_2 p767.AM_3 p767.MM_1 p767.MM_2 p767.MM_3
1  Control-104-KO-64-F       758       682       713      1867      1840      1251
2  Control-110-KO-29-F       674       561       642      1706      1737      1141
3  Control-112-KO-65-F       584       518       495      1328      1448      1012
4  Control-125-KO-31-F       565       462       569      1345      1258       856
5 Control-128-KO-136-F       463       452       424      1253      1258       721
6  Control-132-KO-57-F       854       698       755      1974      1977      1326

I would like to a new data.frame with the mean values calculated as such:

head(p767.m)
                     p767.AM p767.MM
Control-104-KO-64-F  717.6667 1652.667
Control-110-KO-29-F  625.6667 1528.000
Control-112-KO-65-F  532.3333 1262.667
Control-125-KO-31-F  532.0000 1153.000
Control-128-KO-136-F 446.3333 1077.333
Control-132-KO-57-F  769.0000 1759.000

Below is the way to read it into R. I would appreciate the help.

Thanks

structure(list(sgrna = c("Control-104-KO-64-F", "Control-110-KO-29-F", 
"Control-112-KO-65-F", "Control-125-KO-31-F", "Control-128-KO-136-F", 
"Control-132-KO-57-F"), p767.AM_1 = c(758L, 674L, 584L, 565L, 
463L, 854L), p767.AM_2 = c(682L, 561L, 518L, 462L, 452L, 698L
), p767.AM_3 = c(713L, 642L, 495L, 569L, 424L, 755L), p767.MM_1 = c(1867L, 
1706L, 1328L, 1345L, 1253L, 1974L), p767.MM_2 = c(1840L, 1737L, 
1448L, 1258L, 1258L, 1977L), p767.MM_3 = c(1251L, 1141L, 1012L, 
856L, 721L, 1326L)), row.names = c(NA, 6L), class = "data.frame")

tidyverse r • 1.6k views

ADD COMMENT • link updated 10 months ago by gglim ▴ 220 • written 10 months ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

I know it can be done with something like that:

cbind(p767.AM = (rowMeans(p767[,2:4])),
                p767.MM = (rowMeans(p767[,5:7])))

but it would be nice to have a tidyer way of doing it.

ADD REPLY • link 10 months ago by Assa Yeroslaviz ★ 1.9k

1

Entering edit mode

10 months ago

gglim ▴ 220

library(dplyr)
df %>% rowwise() %>% mutate(p767.AM = mean(c_across(contains("AM"))))

Maybe this is what you're looking for?

ADD COMMENT • link 10 months ago by gglim ▴ 220

0

Entering edit mode

I guess the OP doesn't want to hard-code p767.AM and "AM"?

ADD REPLY • link 10 months ago by dariober 15k

0

Entering edit mode

I guess I can live with that. :-)

Can one create a vectors of patterns, e.g. c("AM, "MM") and use it as input to the mutate command?

ADD REPLY • link 10 months ago by Assa Yeroslaviz ★ 1.9k

1

Entering edit mode

meanofcol <- function(df, col) {
  df %>% rowwise() %>% mutate("Mean of {{col}}" := mean(c_across(contains({{col}}))))
}
meanofcol(df, col = "AM")

This seems good for me, reference here

ADD REPLY • link 10 months ago by gglim ▴ 220

0

Entering edit mode

10 months ago

MolGeek ▴ 80

Or just use rowMeans and provide the desired columns (for dataframes with low number of columns):

p767.m <- data.frame(p767.AM = rowMeans(p767[,1:3]) , p767.MM = rowMeans(p767[,4:6])   )

ADD COMMENT • link 10 months ago by MolGeek ▴ 80

1

Entering edit mode

Works here, but it's not generic or tidy since you have to manually enter the correct numbers and enter the colnames of the output.

ADD REPLY • link 10 months ago by ATpoint 87k

0

Entering edit mode

You are right, thats a quick and dirty answer to get the means.

ADD REPLY • link 10 months ago by MolGeek ▴ 80

0

Entering edit mode

exactly, this would be needed to be monitored closely, as the structure changes often.

ADD REPLY • link 10 months ago by Assa Yeroslaviz ★ 1.9k

score 4 · Accepted Answer · 2024-05-27

4

Entering edit mode

10 months ago

dariober 15k

I'm a happy user of data.table:

library(data.table)

dat <- structure(list(sgrna = c("Control-104-KO-64-F", "Control-110-KO-29-F", 
"Control-112-KO-65-F", "Control-125-KO-31-F", "Control-128-KO-136-F", 
"Control-132-KO-57-F"), p767.AM_1 = c(758L, 674L, 584L, 565L, 
463L, 854L), p767.AM_2 = c(682L, 561L, 518L, 462L, 452L, 698L
), p767.AM_3 = c(713L, 642L, 495L, 569L, 424L, 755L), p767.MM_1 = c(1867L, 
1706L, 1328L, 1345L, 1253L, 1974L), p767.MM_2 = c(1840L, 1737L, 
1448L, 1258L, 1258L, 1977L), p767.MM_3 = c(1251L, 1141L, 1012L, 
856L, 721L, 1326L)), row.names = c(NA, 6L), class = "data.frame")

setDT(dat)

ldat <- melt(dat, id.vars='sgrna', variable.name='library_id', value.name='count')
ldat[, sample_id := sub('_\\d+$', '', library_id)]

# Averages in long format:
cntAvg <- ldat[, list(avg=mean(count)), by=list(sgrna, sample_id)]

# If you want to get back to wide format
dcast(cntAvg, sgrna ~ sample_id)

A couple of notes:

It is almost always better to work with data in long format (or tidy or normalised)
I would prefer to get the sample names from a sample sheet rather than parsing the library names with some dodgy regex.

ADD COMMENT • link 10 months ago by dariober 15k

1

Entering edit mode

I would prefer to get the sample names from a sample sheet rather than parsing the library names with some dodgy regex.

+1 !!!

Anecdote: I analyzed data last week where they indeed tried to put metadata into the filename, resulting in a 160 character string that needed deparsing (demessification). Please, people, unless it's a really short and simple encoding, please use sample/annotation sheets.

ADD REPLY • link 10 months ago by ATpoint 87k

0

Entering edit mode

I would prefer to get the sample names from a sample sheet rather than parsing the library names with some dodgy regex.

Not really sure, what you mean here. I read the whole count matrix from a data file. Do you mean a separate file with a list of sample names( one name per row)?

ADD REPLY • link 10 months ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Thanks to your help I managed to do it also in tidyverse, by using first pivot_longer creating a new column with group, calculating the mean by grouping the sgrna and group. Then using pivot_wider to convert it to the format i need removing all unnecessary rows.

melted_df <- pivot_longer(dat, cols = starts_with("p"), names_to = "project", values_to = "counts") |>
  mutate(group = gsub(pattern = "\\_[1|2|3]", replacement = "", x = project)) |>
  group_by(sgrna, group) |> mutate(avg = mean(counts)) |> 
  distinct(sgrna,group,avg) |> 
  pivot_wider(id_cols = sgrna, names_from = group, values_from = avg) |> as.data.frame()

thanks again

Still not sure, how you meant it, when you mentioned reading the sample names from a text file.

ADD REPLY • link 10 months ago by Assa Yeroslaviz ★ 1.9k