Problem with gather() in R
2
1
Entering edit mode
4 months ago
egascon ▴ 60

Hi everyone,

I want to ask you about the gather() function. I have a data frame of 23349 obs. of 560 variables. It looks like this:

enter image description here

These are microarray data. We have their respective SYMBOL identifier and expression values for 559 samples.

My goal is to change the sample names. For example: GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz to GSM2630758. That is, I want to remove all the text after _ to keep the GSM IDs.

To do this, I have written the following code:

data <- data %>% 
  gather(key = 'samples', value = 'counts', -SYMBOL) %>% 
  mutate(samples = gsub("_.*", "", samples)) %>% 
  spread(key = 'samples', value = 'counts') %>% 
  column_to_rownames(var = 'SYMBOL')

But when I apply it by checking each step, in the gather() step, I get the following:

Code:

data.mod.uniq %>% 
  gather(key = 'samples', value = 'counts', -SYMBOL) %>% 
  head()

Result:

    SYMBOL                                      samples           counts
1    DDR1  GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 5.40512912518574
2     RFC2 GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 5.20077702172573
3    HSPA6 GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 8.32865256239474
4     PAX8 GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 7.01433492839524
5   GUCA1A GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 2.45719918379693
6 MIR5193  GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 7.74825598612434

The names of the key samples are duplicated with the first one and I don't understand why if they are all different. How can I solve it? No duplicate genes or duplicate samples, I have already eliminated them previously.

Thank you for your help,

gather r • 561 views
ADD COMMENT
3
Entering edit mode

I don't think anything is duplicated here, you are changing your data frame from wide to long format meaning that each row is representing the count of a given gene-sample combination. You can check the dimensions of the resulting data frame after gather().

And regarding your question of changing the sample names, you can just use sub() or gsub() on the names of the original data frame, no need to gather if this is your only goal.

ADD REPLY
0
Entering edit mode

Hi!

Thank you for all options. I have learned a lot from these solutions. Thanks for the book as well.

In the end it is as Haci says, there was nothing duplicated and everything was changed without a problem. But I keep the other options, anything to learn how to code in a simpler way to manipulate data, is welcome.

Thank you again

ADD REPLY
3
Entering edit mode
4 months ago
zx8754 12k

Looks like we just want to rename the column names, drop anything after "_", using base R:

colnames(df1) <- gsub("_.*", "", colnames(df1))

#check
# colnames(df1)
# [1] "SYMBOL"     "GSM2630758" "GSM2630759" "GSM2630760"
#
#check duplicated
# sum(duplicated(colnames(df1)))
# [1] 0
ADD COMMENT
2
Entering edit mode
4 months ago
BioinfGuru ★ 2.1k

I think you've just over-complicated it all because you haven't studied tidyverse enough....we've all been there. I cannot recommend highly enough going through the book R for Data Science. It will make a huge difference.

If you are just trying to tidy the column names then just use dplyr::replace_with()

# Load libraries
library(dplyr)

# create sample data
df1 <- data.frame (                         
  SYMBOL = c("DDR1", "RFC2", "HSPA6"),    
  "GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz" = c(5.4, 5.2, 8.3),                        
  "GSM2630759_E7R_039a02_HG-U133_Plus_2_.CEL.gz" = c(4.9, 5.2, 8.5),
  "GSM2630760_E7R_039a03_HG-U133_Plus_2_.CEL.gz" = c(5.3, 4.9, 8.6)
)
df1

# tidy col names
df1 |> 
  rename_with(~ gsub("_.*", "", .x))

# RETURNS:

  SYMBOL GSM2630758 GSM2630759 GSM2630760
1   DDR1        5.4        4.9        5.3
2   RFC2        5.2        5.2        4.9
3  HSPA6        8.3        8.5        8.6
ADD COMMENT
4
Entering edit mode

While this is a good tidyverse solution, the base R solution is much more straightforward in my opinion. "Change column names" is the concept here and that rename_with with the .x is something that only people steeped in tidyverse can think about off hand. I've been using tidyverse for a decade now and could not have thought of that syntax no matter how hard I tried. Sometimes tidy can over-complicate things. Use it to manipulate data, not the structures that hold the data.

ADD REPLY
1
Entering edit mode

Definitely a fair point. I went with tidyverse as that's what OP was using.

ADD REPLY

Login before adding your answer.

Traffic: 2832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6