Question

Problem with gather() in R

1

Entering edit mode

7 months ago

egascon ▴ 60

Hi everyone,

I want to ask you about the gather() function. I have a data frame of 23349 obs. of 560 variables. It looks like this:

enter image description here

These are microarray data. We have their respective SYMBOL identifier and expression values for 559 samples.

My goal is to change the sample names. For example: GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz to GSM2630758. That is, I want to remove all the text after _ to keep the GSM IDs.

To do this, I have written the following code:

data <- data %>% 
  gather(key = 'samples', value = 'counts', -SYMBOL) %>% 
  mutate(samples = gsub("_.*", "", samples)) %>% 
  spread(key = 'samples', value = 'counts') %>% 
  column_to_rownames(var = 'SYMBOL')

But when I apply it by checking each step, in the gather() step, I get the following:

Code:

data.mod.uniq %>% 
  gather(key = 'samples', value = 'counts', -SYMBOL) %>% 
  head()

Result:

    SYMBOL                                      samples           counts
1    DDR1  GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 5.40512912518574
2     RFC2 GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 5.20077702172573
3    HSPA6 GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 8.32865256239474
4     PAX8 GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 7.01433492839524
5   GUCA1A GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 2.45719918379693
6 MIR5193  GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 7.74825598612434

The names of the key samples are duplicated with the first one and I don't understand why if they are all different. How can I solve it? No duplicate genes or duplicate samples, I have already eliminated them previously.

Thank you for your help,

gather r • 777 views

ADD COMMENT • link 7 months ago by egascon ▴ 60

3

Entering edit mode

I don't think anything is duplicated here, you are changing your data frame from wide to long format meaning that each row is representing the count of a given gene-sample combination. You can check the dimensions of the resulting data frame after gather().

And regarding your question of changing the sample names, you can just use sub() or gsub() on the names of the original data frame, no need to gather if this is your only goal.

ADD REPLY • link 7 months ago by Haci ▴ 730

0

Entering edit mode

Hi!

Thank you for all options. I have learned a lot from these solutions. Thanks for the book as well.

In the end it is as Haci says, there was nothing duplicated and everything was changed without a problem. But I keep the other options, anything to learn how to code in a simpler way to manipulate data, is welcome.

Thank you again

ADD REPLY • link 7 months ago by egascon ▴ 60

score 3 · Answer 1 · 2024-07-10

3

Entering edit mode

7 months ago

zx8754 12k

Looks like we just want to rename the column names, drop anything after "_", using base R:

colnames(df1) <- gsub("_.*", "", colnames(df1))

#check
# colnames(df1)
# [1] "SYMBOL"     "GSM2630758" "GSM2630759" "GSM2630760"
#
#check duplicated
# sum(duplicated(colnames(df1)))
# [1] 0

ADD COMMENT • link 7 months ago by zx8754 12k

Ram · Answer 2 · 2024-07-10

2

Entering edit mode

7 months ago

BioinfGuru ★ 2.1k

I think you've just over-complicated it all because you haven't studied tidyverse enough....we've all been there. I cannot recommend highly enough going through the book R for Data Science. It will make a huge difference.

If you are just trying to tidy the column names then just use dplyr::replace_with()

# Load libraries
library(dplyr)

# create sample data
df1 <- data.frame (                         
  SYMBOL = c("DDR1", "RFC2", "HSPA6"),    
  "GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz" = c(5.4, 5.2, 8.3),                        
  "GSM2630759_E7R_039a02_HG-U133_Plus_2_.CEL.gz" = c(4.9, 5.2, 8.5),
  "GSM2630760_E7R_039a03_HG-U133_Plus_2_.CEL.gz" = c(5.3, 4.9, 8.6)
)
df1

# tidy col names
df1 |> 
  rename_with(~ gsub("_.*", "", .x))

# RETURNS:

  SYMBOL GSM2630758 GSM2630759 GSM2630760
1   DDR1        5.4        4.9        5.3
2   RFC2        5.2        5.2        4.9
3  HSPA6        8.3        8.5        8.6

ADD COMMENT • link updated 7 months ago by Ram 44k • written 7 months ago by BioinfGuru ★ 2.1k

4

Entering edit mode

While this is a good tidyverse solution, the base R solution is much more straightforward in my opinion. "Change column names" is the concept here and that rename_with with the .x is something that only people steeped in tidyverse can think about off hand. I've been using tidyverse for a decade now and could not have thought of that syntax no matter how hard I tried. Sometimes tidy can over-complicate things. Use it to manipulate data, not the structures that hold the data.