Question

sum of values in a column

0

Entering edit mode

4.1 years ago

pt.taklifi ▴ 60

I have a list named BRCA.b.1

seqnames start   end width strand   score.x   name score.y annotation percentGC percentAT
1     chr1     1  9999  9999      *  0.000000 BRCA_1 1.59268     Distal 0.3852295 0.3592814
2     chr1 10000 10099   100      * 17.716522 BRCA_1 1.59268     Distal 0.3852295 0.3592814
3     chr1 10100 10199   100      * 30.601267 BRCA_1 1.59268     Distal 0.3852295 0.3592814
4     chr1 10200 10299   100      *  9.663558 BRCA_1 1.59268     Distal 0.3852295 0.3592814
5     chr1 10300 10399   100      *  4.831779 BRCA_1 1.59268     Distal 0.3852295 0.3592814
6     chr1 10400 10499   100      *  8.052965   <NA>      NA       <NA>        NA        NA

and I have another list with distinct values of column name called BRCA for each specific name in the column name I want to calculate the sum of scores for corresponding rows and store them in BRCA$coverage. so for the first value in column name that is BRCA_1 I would like to get 62.81313

so far I wrote this code and it works right

for ( i in 1: nrow(BRCA))
{
BRCA$coverage[i]<-  sum(BRCA.b.1[which(BRCA.b.1$name== BRCA$name[i]), 6])
}

but because BRCA.b.1 is a big list it takes a long time to run

can you suggest a more efficient way to do this ?

r • 933 views

ADD COMMENT • link 4.1 years ago by pt.taklifi ▴ 60

0

Entering edit mode

with tsv-utils:

$ cat list1.txt 

seqnames    start   end width   strand  score.x name    score.y annotation  percentGC   percentAT
chr1    1   9999    9999    *   0.000000    BRCA_1  1.59268 Distal  0.3852295   0.3592814
chr1    10000   10099   100 *   17.716522   BRCA_1  1.59268 Distal  0.3852295   0.3592814
chr1    10100   10199   100 *   30.601267   BRCA_1  1.59268 Distal  0.3852295   0.3592814
chr1    10200   10299   100 *   9.663558    BRCA_1  1.59268 Distal  0.3852295   0.3592814
chr1    10300   10399   100 *   4.831779    BRCA_1  1.59268 Distal  0.3852295   0.3592814
chr1    10400   10499   100 *   8.052965    <NA>    NA  <NA>    NA  NA

$ cat list2.txt 

old_name    new_name
BRCA_1  BRCA

output:

$ tsv-join -H -f list2.txt -k 1 -d 7  list1.txt -a 2 | tsv-summarize -H --sum score.x:BRCA_average

BRCA_average
62.813126

ADD REPLY • link 4.0 years ago by cpad0112 21k

score 2 · Accepted Answer · 2021-01-28

2

Entering edit mode

4.1 years ago

Ram 44k

This would benefit from dplyr, although there's definitely a base R solution.

library(dplyr)
BRCA.b.1 %>% filter(name %in% BRCA$name) %>% group_by(name) %>% summarise(total_score = sum(score), .groups = 'keep')

ADD COMMENT • link 4.1 years ago by Ram 44k

0

Entering edit mode

thank you @_r_am for your response I tried the command you mentioned but I get the following error

Error: Problem with `summarise()` input `total_score`.
x invalid 'type' (closure) of argument
ℹ Input `total_score` is `sum(score)`.
ℹ The error occurred in group 1: name = "BRCA_1".
Run `rlang::last_error()` to see where the error occurred.

ADD REPLY • link 4.1 years ago by pt.taklifi ▴ 60

2

Entering edit mode

Usually it occurs when the dataframe passed to summarise() has no column named 'score' but there is a 'score' function available in your environment. In this case code does not throws an error like "that there is no such column" and dies trying to compute sum of a function 'score' instead. I conclude from your example above that it should really be sum(score.x).