Hi all, I need some help with python and pandas.
I actually have a dataframe with in the column seq1_id al the seq_id of sequences of the species 1 and the column 2
for the sequences of the sp2
.
I actually passed a filter on those sequences and got two dataframes (one with all sequences of sp 1
passed through the filter) and (one with all sequences of sp2
passed through the filter).
Then I have 3 dataframes.
Because in a pairs, one seq can passe the filter while the other does not, it is important to keep only paired genes wich are keeping on the two previous filtering, so what I need to do is acutally to parse my first df such this one:
Seq1.id Seq2.id
seq1_01 seq5_02
seq2_01 Seq6_02
seq3_01 Seq7_02
seq4_01 Seq8_02
and check row by row if (ex the first row) seq1_01
is present in the df2
and if seq8_02 is also present in the df3
, then keep this row in the df1
and add it in a new df4
.
Here is an exemple with output wanted:
first df:
Seq1.id Seq2.id
seq1_01 seq5_02
seq2_01 Seq6_02
seq3_01 Seq7_02
seq4_01 Seq8_02
df2 (sp1) (seq3_01 is absent)
Seq_1.id
seq1_01
seq2_01
seq4_01
df3 (sp2) (Seq8_02 is absent)
Seq_2.id
seq5_02
Seq6_02
Seq7_02
Then because Seq8_02
and seq3_01
are not present, the df4
(output) would be:
Seq1.id Seq2.id
seq1_01 seq5_02
seq2_01 Seq6_02
Hi tried:
HGT_candidats_0035=candidates_0035
HGT_candidats_0042=candidates_0042
#convert gene names into a list
gene_name_0035=[]
for i in HGT_candidats_0035["gene"]:
gene_name_0035.append(i)
gene_name_0042=[]
for i in HGT_candidats_0042["gene"]:
gene_name_0042.append(i)
#Keep only paired sequences
seq1_id=[]
for i in dN_dS["seq1_id"]:
seq1_id.append(i)
seq2_id=[]
for i in dN_dS["seq2_id"]:
seq2_id.append(i)
newdf = pd.DataFrame(columns=("seq1_id","seq2_id"))
for a, b in zip(seq1_id,seq2_id):
if a in gene_name_0035 and b in gene_name_0042:
newdf=newdf.append({"seq1_id":a,"seq2_id":b}, ignore_index=True)
But I think it is too long
Here is you code: with my data
candidates_0035=pd.read_csv("candidates_genes_filtering_0035",sep='\t')
candidates_0042=pd.read_csv("candidates_genes_filtering_0042",sep='\t')
dN_dS=pd.read_csv("dn_ds.out_sorted",sep='\t')
df4 = pd.DataFrame(columns=dN_dS.columns)
print(df4)
for index, row in dN_dS.iterrows():
if row['seq1_id'] in candidates_0042['gene'] and row['seq2_id'] in candidates_0035['gene']:
df4 = df4.append(row, ignore_index=True)
df4.to_csv("new_df",sep='\t')
and here the empty output of df4:
Unnamed: 0 Unnamed: 0.1 seq1_id seq2_id dN dS Dist_third_pos Dist_brute Length_seq_1 Length_seq_2 GC_content_seq1 GC_content_seq2 GC Mean_length
Here are the data: https://drive.google.com/file/d/1FR9MUk4x0NoM-r3F4oe6dt5HgDMaUlKv/view?usp=sharing https://drive.google.com/file/d/1MWRJwqRAA2B7eAXG1hcnIAqeQyjtx7pT/view?usp=sharing https://drive.google.com/file/d/10ZP-Awx_qevKoT-AfMjDpd8KKaUcsEog/view?usp=sharing
This is a short story about loop over the first dataframe then look in the 2 others.
Try something first and i'll help :)
I posted what I tried but I think it is too long and maybe I do some mistake as well
This is how you create your
first df
not how you try to resolve your issue.You need to loop over your
newdf
nowCare at the french typos you made in your post (passe, filtre...). That could bother more than one person here.
Well if I do the
for a, b in zip(seq1_id,seq2_id)
: part I ask for each row if we find this same seq_id in the df2 and df3, then add to a new df the row in the first df no? Yep sorry I corrected itOk, I just got that
dN_dS
is yourfirst df
,candidates_0035
is yourdf2
andcandidates_0042
is yourdf3
I didn't try but something like this (or very close) should work :
Thank you but I updated my first post with the code I used, The df4 outpfile is actually empty (only columns names). I also put my data if you want to see them.
Finnally I succed by passing the df2 and 3 into a list.
Here you go (with your files) :
The simplest solution would be to read the "Unnamed: 0" column as the index. So, what you have to do is to specify an index_col=[0] argument to read_csv() function, then it reads in the first column as the index.
In some other cases, it is caused by your to_csv() having been saved along with an "Unnamed: 0" index. You could have avoided this mistakes in the first place by using "index=False" if the output CSV was created in DataFrame.