Question

Failure in inner_join

1

Entering edit mode

6 months ago

egascon ▴ 60

I have two dataframes (df1 and df2) that look like this:

df1:

enter image description here

df2:

enter image description here

My aim is to change the IDs of df2 to geo_accession using df1 as reference:

Example: SYMBOL GSM4187205 ......

But I have encountered two problems:

The number of variables in df1 is 587 (number of samples) while the number of samples in df2 is 552. 35 samples have been removed. How can I fix df1 to be exactly the same as df2?
I have written the following code:

df2.mod <- df2 %>%
  gather(key = "samples", value = "counts", -SYMBOL) %>%
  mutate(samples = gsub("X", "", samples)) %>%
  inner_join(., df1, by = c('samples' = 'title')) %>%
  spread(key = 'geo_accession', value = 'counts') %>%
  column_to_rownames(var = 'SYMBOL')

But it doesn't work because the number of samples and title variables don't match.

Error: <0 rows> (o 0- extensión row.names)

Thank you for your help,

dplyr r • 543 views

ADD COMMENT • link updated 6 months ago by zx8754 12k • written 6 months ago by egascon ▴ 60

0

Entering edit mode

My aim is to change the IDs of df2 to geo_accession using df1 as reference:

Example: SYMBOL GSM4187205 ......

Can you confirm that values in header of dataframe 2 correspond to column 2 of dataframe 1?

ADD REPLY • link 6 months ago by GenoMax 147k

0

Entering edit mode

Hi,

I confirm. I I manually checked the first 10 with the full dataset. Both files come from the same data: one is the metadata dataset and the other is the gene expression dataset.

ADD REPLY • link 6 months ago by egascon ▴ 60

GenoMax · Answer 1 · 2024-06-06

0

Entering edit mode

6 months ago

zx8754 12k

gsub might be removing more than you need, see example:

id <- c("X123", "X234_A", "X234_X")

gsub("X", "", id)
# [1] "123"   "234_A" "234_" 
gsub("^X", "", id)
# [1] "123"   "234_A" "234_X"

Either fix the gsub line or when you are reading in the data set check.names = FALSE:

df2 <- read.table("myfile.txt", header = TRUE, check.names = FALSE)

ADD COMMENT • link 6 months ago by zx8754 12k

0

Entering edit mode

Hi,

Adding when I am reading the dataset: check.names = FALSE, it's works! The names come out without the X but, the final result after the "inner_join" is:

[1] SYMBOL        samples       counts        geo_accession    
<0 rows> (o 0- extensión row.names)

I think that it's the number of the variables for name ID: in df2 is 552 and in df1 is 587. How do I know and remove excess variables from df1?

ADD REPLY • link updated 6 months ago by GenoMax 147k • written 6 months ago by egascon ▴ 60

0

Entering edit mode

You need to provide reproducible example datasets for us to help you further.

ADD REPLY • link 6 months ago by zx8754 12k