Hi Everyone, I'm trying to use substitution matrix to encode peptide sequence using pandas for Pytorch ML model. The sub. matrix has 20 columns and 20 rows, I want to substitute a letter from a peptide in peps e.g. "A" with 20 values in sub_matrix column A. I tried using .to_string(index=False), but it returns other characters and the values are actually floats not strings so it's definitely not ideal.
What can I use instead to get only the values w/o spaces and new lines?
Also I would be super glad if anyone can suggest what is the best way to process this data in Pytorch? I have previously used some ML packages in R, where all the values would be in one data frame. Is it good to have all in one list and then convert to a tensor or having a list for each peptide?
my pandas data frame:
sub_matrix = pd.read_csv('blosum62_pd_ori.txt', header = 0, nrows = 20)
sub_matrix
A R ... V
0 0.2901 0.0310 ... 0.0688
1 0.0446 0.3450 ... 0.0310
2 0.0427 0.0449 ... 0.0270
... ... ... ... ...
17 0.0303 0.0227 ... 0.0303
18 0.0405 0.0280 ... 0.0467
19 0.0700 0.0219 ... 0.2689
peps = ['GARRNDACE', 'QEERGGDPA']
the code:
def encode(pep):
AAs = list(pep)
encoded = []
for aa in AAs:
if aa in sub_matrix.columns:
freqs = sub_matrix[aa].to_string(index=False)
encoded.append(freqs)
return encoded
for pep in peps:
print(encode(pep))
I would like the output to be one non-nested list or all values, like:
['0.0783', '0.0329', '0.0652', '0.0466', '0.0325', '0.0412', '0.0350', ..., '0.5101', '0.0382', '0.0206', '0.0213', '0.0432', '0.0281', '0.0254', '0.0233']
but now it is:
[' 0.0783\n 0.0329\n 0.0652\n 0.0161\n 0.0106\n 0.0103\n 0.0175\n 0.0178\n ... ,' 0.0405\n 0.0523\n 0.0494\n 0.0914\n 0.0163\n 0.1029\n 0.2965']
[' 0.0256\n 0.0484\n 0.0337\n 0.0299\n 0.0122\n 0.2147\n 0.0645\n 0.0189\n ... ,' 0.0338\n 0.0568\n 0.1099\n 0.0730\n 0.0303\n 0.0405\n 0.0700']
Thank you. I have actually realised this simple thing myself! Feels a bit silly, thanks.