Question

how to sum up the columns to remove the duplicated row names in RSEM output?

0

Entering edit mode

6.6 years ago

John ▴ 270

Hi ,

In a RSEM output table I have 64 columns and 24833 rows. In that I have some duplicate row names, I want to remove the duplicates by sum up those duplicated rows (corresponding all 64 columns), here row names are gene names and column names are sample name. I am new to R, can you please help me with R code for this.

> all <-read.table(file="tpmat.xls",header=T)
> dim(all)
[1] 24833    64

R RNA-Seq • 24k views

ADD COMMENT • link updated 6.6 years ago by Nicolas Rosewick 11k • written 6.6 years ago by John ▴ 270

1

Entering edit mode

How to sum up the duplicated value while keep the other columns?

Play with suggestions in this thread. It should work.

ADD REPLY • link 6.6 years ago by venu 7.1k

score 4 · Answer 1 · 2018-03-12

Using dplyr you can use group_by and summarise_all.

Here's an example :

require(dplyr)

> a
# A tibble: 7 x 4
  gene  sample1 sample2 sample3
  <chr>   <int>   <int>   <int>
1 A           1       1       1
2 B           1       1       1
3 B           1       1       1
4 C           1       1       1
5 C           1       1       1
6 C           1       1       1
7 D           1       1       1

    a %>% 
     group_by(gene) %>% 
     summarise_all(funs(sum))

# A tibble: 4 x 4
      gene  sample1 sample2 sample3
      <chr>   <int>   <int>   <int>
    1 A           1       1       1
    2 B           2       2       2
    3 C           3       3       3
    4 D           1       1       1

score 2 · Answer 2 · 2018-03-12

The duplicated rownames are not allowed in the object of read.table got actaully.

The main idea is use dplyr::group_by, which gets the duplicated column group-wisely and dplyr::summarise_all(sum), which sums all values in group.

Example code as the following:

# rowname_duplicated is the colname you mentioned.    
dplyr::group_by(all, rowname_duplicated) %>% dplyr::summarise_all(sum)