Question

Matching gene ids in R

0

Entering edit mode

4.0 years ago

aj123 ▴ 120

I have two .csv files. I need to match the 1st rows IDs in one file with the 1st column IDs in another file. im trying this-

If the first file contains a column named "ids" then i do this to extract the sample ids-

first_file_samples <- first_data_frame$ids

And for the second file, if all of the column names are ids-

second_file_samples <- colnames(second_data_frame)

then use this above function to extract sample ids for the second file. Then extract out the intersection between these vectors-

intersect_sample_ids <- intersect(first_file_samples, second_file_samples)

To filter out the first file, then-

 subset_first_file <- first_data_frame %>% filter(ids %in% intersect_sample_ids)

 subset_second_file <- second_data_frame %>% select(all_of(intersect_sample_ids))

But it does not seem to be working. Please tell me what could be going wrong?

R rna-seq • 883 views

ADD COMMENT • link updated 4.0 years ago by bkleiboeker ▴ 370 • written 4.0 years ago by aj123 ▴ 120

0

Entering edit mode

Hi, can you include the first few lines from each of the files, and an example of the desired output? The easiest way to share the data would be the output of dput(head(first_file_samples)) as an example for the first file.

ADD REPLY • link 4.0 years ago by rpolicastro 13k

score 0 · Answer 1 · 2021-01-11

Here's an workaround I use sometimes to extract a column from one df to another by aligning one row, like geneIDs:

Say column 2 in df1 contains logCPM values, then we could 'collect' those values in a new column in df2 (call it df2$logCPM) by like geneID using

df2$logCPM<-as.matrix(df1)[,2][match(df2$ID,df1$ID)]

I'm curious to see if there's a better way to do what you're saying, but I would do it using the above code one column at a time to combine the information in the two dataframes into one single dataframe with all relevant info. The worst part about my solution is the inherent use of a magic number (the column number of desired information in df1), so I'm hopeful I can learn a more dynamic solution to this problem as well!