Failure in inner_join
1
1
Entering edit mode
6 months ago
egascon ▴ 60

I have two dataframes (df1 and df2) that look like this:

df1:

enter image description here

df2:

enter image description here

My aim is to change the IDs of df2 to geo_accession using df1 as reference:

Example: SYMBOL GSM4187205 ......

But I have encountered two problems:

  1. The number of variables in df1 is 587 (number of samples) while the number of samples in df2 is 552. 35 samples have been removed. How can I fix df1 to be exactly the same as df2?
  2. I have written the following code:
df2.mod <- df2 %>%
  gather(key = "samples", value = "counts", -SYMBOL) %>%
  mutate(samples = gsub("X", "", samples)) %>%
  inner_join(., df1, by = c('samples' = 'title')) %>%
  spread(key = 'geo_accession', value = 'counts') %>%
  column_to_rownames(var = 'SYMBOL')

But it doesn't work because the number of samples and title variables don't match.

Error: <0 rows> (o 0- extensión row.names)

Thank you for your help,

dplyr r • 544 views
ADD COMMENT
0
Entering edit mode

My aim is to change the IDs of df2 to geo_accession using df1 as reference:

Example: SYMBOL GSM4187205 ......

Can you confirm that values in header of dataframe 2 correspond to column 2 of dataframe 1?

ADD REPLY
0
Entering edit mode

Hi,

I confirm. I I manually checked the first 10 with the full dataset. Both files come from the same data: one is the metadata dataset and the other is the gene expression dataset.

ADD REPLY
0
Entering edit mode
6 months ago
zx8754 12k

gsub might be removing more than you need, see example:

id <- c("X123", "X234_A", "X234_X")

gsub("X", "", id)
# [1] "123"   "234_A" "234_" 
gsub("^X", "", id)
# [1] "123"   "234_A" "234_X"

Either fix the gsub line or when you are reading in the data set check.names = FALSE:

df2 <- read.table("myfile.txt", header = TRUE, check.names = FALSE)
ADD COMMENT
0
Entering edit mode

Hi,

Adding when I am reading the dataset: check.names = FALSE, it's works! The names come out without the X but, the final result after the "inner_join" is:

[1] SYMBOL        samples       counts        geo_accession    
<0 rows> (o 0- extensión row.names)

I think that it's the number of the variables for name ID: in df2 is 552 and in df1 is 587. How do I know and remove excess variables from df1?

ADD REPLY
0
Entering edit mode

You need to provide reproducible example datasets for us to help you further.

ADD REPLY

Login before adding your answer.

Traffic: 1854 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6