Why am I getting only one common value when comparing two excel files?
2
0
Entering edit mode
23 months ago
namck • 0

I have two excel files which have quite a a lot common gene names. What I would like to do is finding the common gene names and adding the corresponding columns with their values to a new csv file.

The below code gives an output containing a single row, which should not be the case as I have sufficient common gene names. Therefore, can someone please tell me what I'm doing wrong here?

Thanks!

First excel file, First excel file

Second excel file, second excel file

Ouput what I got, (the formatting is correct, but why am I getting only one?) My desired output should look like this

Snippet from the code, snippet from the code I've tried

python pandas excel • 1.1k views
ADD COMMENT
0
Entering edit mode

Are you certain there are absolute matches between the two? The example shows df2 with many gene names capitalized. These won't match if they are not also capitalized in df1. Can you sort each and show a top set with names in common? Could there be a trailing space, or other character in one of the gene name columns? (unlikely, but I've seen this happen).

ADD REPLY
0
Entering edit mode

Thanks! I have made both the files lowercase and run the same file. This time it returned a list of common genes, but subsequent column are missing now. I am getting output like this, new_output

ADD REPLY
0
Entering edit mode

According to your code, you asked for a set of common values, and that's what it returned. Are you trying to merge the two data frames with these values? (I gave an answer for this below). Or are you trying to subset one data frame by the values in common? What code are you running for the image above? What were you trying to achieve?

ADD REPLY
0
Entering edit mode
23 months ago
bkleiboeker ▴ 370

It looks like the first dataframe has gene names with the first letter capitalized, while the second dataframe has the gene names as all lowercase.

In any case, I'd personally suggest trying to go upstream and use Ensembl or Entrez IDs instead of gene symbols if possible.

ADD COMMENT
0
Entering edit mode
23 months ago
seidel 11k

You mentioned fixing the case of the values shared by each table, so that they will match. If you'd like to join the two tables together, you could do so using the merge command. For example:

import pandas as pd

# create two data frames
df1 = pd.DataFrame({'gene': ['act1', 'gapdh', 'foomoo'], 'expression': [3,1,-12], 'chr': ['chr1', 'chr12', 'chr39']})
df2 = pd.DataFrame({'ID': ['oct4', 'gapdh', 'act1'], 'tiktok_rank': [23,145,3], 'fav_cell_type': ['skin', 'lips', 'banana']})

# see what they have in common
set(df1['gene']).intersection(set(df2['ID']))

{'act1', 'gapdh'}

# merge them to one
pd.merge(df1, df2, left_on='gene', right_on='ID')

merged

ADD COMMENT

Login before adding your answer.

Traffic: 2107 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6