Hi all :)
I have a question about distance matrices produced by Clustal Omega application .
It's well known to all that they represent the similarities between each pair of sequences in both distance and percentage representation as follows :
100.000000 21.944035 22.133939 23.723042 19.750284 20.431328 20.885358 21.679909
21.944035 100.000000 22.827688 21.796760 22.974963 20.324006 21.944035 24.889543
22.133939 22.827688 100.000000 21.152030 22.474032 17.387033 19.830028 20.963173
23.723042 21.796760 21.152030 100.000000 20.437018 24.361493 19.059107 19.436957
19.750284 22.974963 22.474032 20.437018 100.000000 21.414538 20.094259 21.765210
20.431328 20.324006 17.387033 24.361493 21.414538 100.000000 20.432220 20.432220
20.885358 21.944035 19.830028 19.059107 20.094259 20.432220 100.000000 19.018898
21.679909 24.889543 20.963173 19.436957 21.765210 20.432220 19.018898 100.000000
But what if I wanted to find the difference percentage between each pair of sequences, depending on those matrices?!
I'm working on a pipeline that needs to filter out similarity values >= 90.00 for left flanking region and difference values >= 50.00 for right flanking region , here's the code snippet I wrote to find that :
files=['Arr-Right(Aestivum_Japonica).dst','Arr-Left(Aestivum_Japonica).dst']
for I in range(len(files)):
name=files[i][files[i].find("-")+1:files[i].find(".")]
retrieved=open("Rtrv-"+name+".csv",'w',newline='')
retrieved.write(str('{0:^14}\t{1:^8}\t{2:^10}\n'.format(str("Similarity (%)"),str("Query ID"),str("Subject ID"))))
data=np.genfromtxt(files[i])
for row_idx, row in enumerate(data):
for col_idx, element in enumerate(row):
if row_idx >= col_idx :
continue
elif ("Left" in name and element>=90.000000):
retrieved.write(str('{0:10.6f}\t{1:0d}\t{2:0d}\n'.format(element,row_idx,col_idx)))
elif ("Right" in name and (100-element)>=50.000000) :
retrieved.write(str('{0:10.6f}\t{1:0d}\t{2:0d}\n'.format(element,row_idx,col_idx)))
retrieved.close()
My question is about the correctness of the equation I used : Is it simply (100-element)>=50.000000
or am I missing something ?!
Thanks in advance
Edited : to add the list of file names to the code snippet
Would someone help me with this , please ?!
I really need to get the right answer , thank you all .
Looks good to me, though I don't understand the first 4 lines of code. Maybe explain the code a little bit?
@RamRS... The first 4 lines iterates over a list of matrices file names , process the file name to eliminate some prefix I added earlier to distinguish them from other files , add a new prefix to the retrieved result's file name , open it for writing and add some header before starting the filtering part .
I wrote it that way to avoid overwriting and have the final file names clear from prefixes and suffixes , that's all :)
Oh, I see. Does the code work?
@RamRS...Yes , it works perfectly :D
I'm afraid of having concept error in that equation , can you please confirm it's correctness for me ?!
That is what I was wondering as well, but I guess 100-similarity is a crude measure of dissimilarity. How else would you find a quantifying parameter for difference from similarity matrices?