Hello,
I have 2 dataframes.
Dataframe 1 is a list of pdb chains presenting a particular substructure, and looks like : df1 = PDBCH RESIDUE_1 RESIDUE_N SUBSTRUCTURE_SEQUENCE
Dataframe 2 is the pdb2pfam mapping file from here http://ftp.ebi.ac.uk/pub/databases/Pfam/mappings/, and looks like: df2 = PDBCH PDB_START PDB_END PFAM_ACCESSION PFAM_NAME
where PDBCH means PDB code + Chain, so 5 character entries.
To map the PDBCH entries in my df1 to df2 and thus get the Pfam families for each of my results in df1 I do this:
df1_to_pfam_list = []
for index, value in enumerate (df1.PDBCH): pfam_indexes_list = df2.index[df2['PDBCH'] == value].tolist() df3 = pdb2pfam.iloc[pfam_indexes_list, :] df1_to_pfam_list.append(df3)
df1_to_pfam_df = pd.concat(df1_to_pfam_list)
Thus, df1_to_pfam_df looks like df2 but following the order of df1 and containing the indexes of df2. this is:
Index_df2 PDBCH(df1 order) PDB_START PDB_END PFAM_ACCESSION PFAM_NAME
Now I need to merge this new dataframe (df1_to_pfam_df) to df1 so that I can check if the sequence RESIDUE_1 TO RESIDUE_N in df1 are inside or not the Pfam domains (PDB_START TO PDB_END entries in df2). The problem is that df1_to_pfam_df is different in size that df1 because some pdbch entries are mapping to more than 1 Pfam family.
I'm quite stuck at this point. Any suggestions?
Thank you Juan