Hi everyone,
I want to ask you about the gather()
function. I have a data frame of 23349 obs. of 560 variables. It looks like this:
These are microarray data. We have their respective SYMBOL identifier and expression values for 559 samples.
My goal is to change the sample names. For example: GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz
to GSM2630758
. That is, I want to remove all the text after _
to keep the GSM IDs.
To do this, I have written the following code:
data <- data %>%
gather(key = 'samples', value = 'counts', -SYMBOL) %>%
mutate(samples = gsub("_.*", "", samples)) %>%
spread(key = 'samples', value = 'counts') %>%
column_to_rownames(var = 'SYMBOL')
But when I apply it by checking each step, in the gather()
step, I get the following:
Code:
data.mod.uniq %>%
gather(key = 'samples', value = 'counts', -SYMBOL) %>%
head()
Result:
SYMBOL samples counts
1 DDR1 GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 5.40512912518574
2 RFC2 GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 5.20077702172573
3 HSPA6 GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 8.32865256239474
4 PAX8 GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 7.01433492839524
5 GUCA1A GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 2.45719918379693
6 MIR5193 GSM2630758_E7R_039a01_HG-U133_Plus_2_.CEL.gz 7.74825598612434
The names of the key samples are duplicated with the first one and I don't understand why if they are all different. How can I solve it? No duplicate genes or duplicate samples, I have already eliminated them previously.
Thank you for your help,
I don't think anything is duplicated here, you are changing your data frame from wide to long format meaning that each row is representing the count of a given gene-sample combination. You can check the dimensions of the resulting data frame after
gather()
.And regarding your question of changing the sample names, you can just use
sub()
orgsub()
on the names of the original data frame, no need to gather if this is your only goal.Hi!
Thank you for all options. I have learned a lot from these solutions. Thanks for the book as well.
In the end it is as Haci says, there was nothing duplicated and everything was changed without a problem. But I keep the other options, anything to learn how to code in a simpler way to manipulate data, is welcome.
Thank you again