Question

Parsing df with pandas and some conditions

0

Entering edit mode

7.5 years ago

benjamin ▴ 10

Hi all, I need some help with python and pandas.

I actually have a dataframe with in the column seq1_id al the seq_id of sequences of the species 1 and the column 2 for the sequences of the sp2.

I actually passed a filter on those sequences and got two dataframes (one with all sequences of sp 1 passed through the filter) and (one with all sequences of sp2 passed through the filter).

Then I have 3 dataframes.

Because in a pairs, one seq can passe the filter while the other does not, it is important to keep only paired genes wich are keeping on the two previous filtering, so what I need to do is acutally to parse my first df such this one:

Seq1.id      Seq2.id
seq1_01     seq5_02
seq2_01     Seq6_02
seq3_01     Seq7_02
seq4_01     Seq8_02

and check row by row if (ex the first row) seq1_01 is present in the df2 and if seq8_02 is also present in the df3, then keep this row in the df1 and add it in a new df4.

Here is an exemple with output wanted:

first df: 

 Seq1.id     Seq2.id
seq1_01     seq5_02
seq2_01     Seq6_02
seq3_01     Seq7_02
seq4_01     Seq8_02

df2 (sp1) (seq3_01 is absent)

    Seq_1.id   
    seq1_01     
    seq2_01        
    seq4_01 


df3 (sp2) (Seq8_02 is absent)

   Seq_2.id
   seq5_02
   Seq6_02
   Seq7_02

Then because Seq8_02 and seq3_01 are not present, the df4 (output) would be:

    Seq1.id   Seq2.id
    seq1_01     seq5_02
    seq2_01    Seq6_02

Hi tried:

HGT_candidats_0035=candidates_0035
HGT_candidats_0042=candidates_0042

#convert gene names into a list
gene_name_0035=[]
for i in HGT_candidats_0035["gene"]:
    gene_name_0035.append(i)

gene_name_0042=[]
for i in HGT_candidats_0042["gene"]:
    gene_name_0042.append(i)

#Keep only paired sequences
seq1_id=[]
for i in dN_dS["seq1_id"]:
    seq1_id.append(i)

seq2_id=[]
for i in dN_dS["seq2_id"]:
    seq2_id.append(i)

newdf = pd.DataFrame(columns=("seq1_id","seq2_id"))

for a, b in zip(seq1_id,seq2_id):
    if a in gene_name_0035 and b in gene_name_0042:
        newdf=newdf.append({"seq1_id":a,"seq2_id":b}, ignore_index=True)

But I think it is too long

Here is you code: with my data

candidates_0035=pd.read_csv("candidates_genes_filtering_0035",sep='\t')
candidates_0042=pd.read_csv("candidates_genes_filtering_0042",sep='\t')
dN_dS=pd.read_csv("dn_ds.out_sorted",sep='\t')

df4 = pd.DataFrame(columns=dN_dS.columns)
print(df4)
for index, row in dN_dS.iterrows():
    if row['seq1_id'] in candidates_0042['gene'] and row['seq2_id'] in candidates_0035['gene']:
        df4 = df4.append(row, ignore_index=True)

df4.to_csv("new_df",sep='\t')

and here the empty output of df4:

Unnamed: 0  Unnamed: 0.1    seq1_id seq2_id dN  dS  Dist_third_pos  Dist_brute  Length_seq_1    Length_seq_2    GC_content_seq1 GC_content_seq2 GC  Mean_length

Here are the data: https://drive.google.com/file/d/1FR9MUk4x0NoM-r3F4oe6dt5HgDMaUlKv/view?usp=sharing https://drive.google.com/file/d/1MWRJwqRAA2B7eAXG1hcnIAqeQyjtx7pT/view?usp=sharing https://drive.google.com/file/d/10ZP-Awx_qevKoT-AfMjDpd8KKaUcsEog/view?usp=sharing

pandas python • 4.5k views

ADD COMMENT • link updated 4.6 years ago by stevejonn • 0 • written 7.5 years ago by benjamin ▴ 10

0

Entering edit mode

This is a short story about loop over the first dataframe then look in the 2 others.

Try something first and i'll help :)

ADD REPLY • link 7.5 years ago by Bastien Hervé 6.5k

0

Entering edit mode

I posted what I tried but I think it is too long and maybe I do some mistake as well

ADD REPLY • link 7.5 years ago by benjamin ▴ 10

0

Entering edit mode

This is how you create your first df not how you try to resolve your issue.

You need to loop over your newdf now

Care at the french typos you made in your post (passe, filtre...). That could bother more than one person here.

ADD REPLY • link 7.5 years ago by Bastien Hervé 6.5k

0

Entering edit mode

Well if I do the for a, b in zip(seq1_id,seq2_id): part I ask for each row if we find this same seq_id in the df2 and df3, then add to a new df the row in the first df no? Yep sorry I corrected it

ADD REPLY • link 7.5 years ago by benjamin ▴ 10

1

Entering edit mode

Ok, I just got that dN_dS is your first df, candidates_0035 is your df2 and candidates_0042 is your df3

I didn't try but something like this (or very close) should work :

df4 = pd.DataFrame(columns=dN_dS.columns)
for index, row in dN_dS.iterrows():
    if row['Seq1.id'] in candidates_0035['Seq_1.id'] and row['Seq2.id'] in candidates_0042['Seq_2.id']:
        df4 = df4.append(row, ignore_index=True)

ADD REPLY • link 7.5 years ago by Bastien Hervé 6.5k

0

Entering edit mode

Thank you but I updated my first post with the code I used, The df4 outpfile is actually empty (only columns names). I also put my data if you want to see them.

ADD REPLY • link 7.5 years ago by benjamin ▴ 10

0

Entering edit mode

Finnally I succed by passing the df2 and 3 into a list.

ADD REPLY • link 7.5 years ago by benjamin ▴ 10

0

Entering edit mode

Here you go (with your files) :

candidates_0035=pd.read_csv("candidates_genes_0035",sep='\t')
candidates_0042=pd.read_csv("candidates_genes_0042",sep='\t')
dN_dS=pd.read_csv("dn_ds.out_sorted",sep='\t')

df4 = pd.DataFrame(columns=dN_dS.columns)

for index, row in dN_dS.iterrows():
    if row['seq1_id'] in candidates_0035['gene'].tolist() and row['seq2_id'] in candidates_0042['gene'].tolist():
    df4 = df4.append(row, ignore_index=True)

df4.to_csv("new_df",sep='\t')

ADD REPLY • link 7.5 years ago by Bastien Hervé 6.5k

0

Entering edit mode

The simplest solution would be to read the "Unnamed: 0" column as the index. So, what you have to do is to specify an index_col=[0] argument to read_csv() function, then it reads in the first column as the index.

pd.read_csv('file.csv', index_col=[0])

In some other cases, it is caused by your to_csv() having been saved along with an "Unnamed: 0" index. You could have avoided this mistakes in the first place by using "index=False" if the output CSV was created in DataFrame.

df.to_csv('file.csv', index=False)

ADD REPLY • link 4.6 years ago by stevejonn • 0

score 0 · Answer 1 · 2018-05-18

import pandas as pd
from pandasql import sqldf as psql

print(df1)
print(df2)
print(df3)

    Seq1.id  Seq2.id
0  seq1.01  seq5.02
1  seq2.01  Seq6.02
2  seq3.01  Seq7.02
3  seq4.01  Seq8.02
  Seq.1.id
0  seq1.01
1  seq2.01
2  seq4.01
  Seq.2.id
0  seq5.02
1  Seq6.02
2  Seq7.02

psql('select [Seq1.id],[Seq2.id] from df1 inner join df2 on df1.[Seq1.id]=df2.[Seq.1.id] inner join df3 on df1.[Seq2.id]=df3.[Seq.2.id]')

     Seq1.id    Seq2.id
0   seq1.01     seq5.02
1   seq2.01     Seq6.02

Following output needs a little bit of home work (with same output):

df1.merge(df2, left_on='Seq1.id', right_on='Seq.1.id', sort=True, how='inner').merge(df3,left_on='Seq2.id', right_on='Seq.2.id',sort=True, how="inner")

    Seq1.id     Seq2.id     Seq.1.id    Seq.2.id
0   seq2.01     Seq6.02     seq2.01     Seq6.02
1   seq1.01     seq5.02     seq1.01     seq5.02