Entering edit mode
2.1 years ago
Paula
▴
60
Hi! I am trying to write a code to use the information contained in a text file and convert it to a nested dictionary. Here, the first dictionary contains the names of the clusters (Cluster 0, Cluster 1), the dictionary named "samples" contains the names of the samples (SOL_1_3
,SOL_1_50
,_SOL_1_40
) and each sample has a calculated cov value. For example, in Cluster 0, the cov value for sample SOL_1_50
is 7, which is the sum of the values cov values for the sample (cov_3.5
).
>Cluster 0
0 948aa, >SOL_1_50_cov_3.5_N_171282... at 100.00%
1 815aa, >SOL_1_50_cov_3.5_N_190968... at 100.00%
2 13323aa, >SOL_1_40_cov_79.5_N_6768... *
3 395aa, >SOL_1_3_cov_5.5_N_257377... at 90.38%
>Cluster 1
0 1759aa, >SOL_1_50_cov_5.5_N_75037... at 100.00%
1 1055aa, >SOL_1_50_cov_4.5_N_129969... at 99.91%
The desired output is the following:
{'Cluster 0': {'samples': {'SOL_1_50': 7, 'SOL_1_40'':79.5, 'SOL 1_3'":5.5, 'SOL_1_10':0}}, 'Cluster 1': {'samples': {'SOL_1_50':10, 'SOL_1_40':0, 'SOL 1_3':0, 'SOL_1_10':0}}}
Here is my script:
f_in = 'real_short_test_cluster.txt'
f_out = 'output.txt'
if __name__ == '__main__':
with open(f_in, 'r') as f:
lines = f.readlines()
f.close()
dct_cluster_sol = dict()
current_cluster = ''
nested_dic = {'SOL 1_3','SOL_1_40','SOL_1_10','SOL_1_50'}
#all_keys = []
#coverage_count = 0
for line in lines:
if "Cluster" in line.strip():
current_cluster = line.strip().split('>')[1]
dct_cluster_sol[current_cluster] = dict()
print('perro')
print(dct_cluster_sol)
elif ">SOL_" in line.strip():
id = line.strip().split('\t')[1].split('>')[1].split('_cov')[0]
coverage = line.strip().split('\t')[1].split('>')[1].split('_')[4]
print(coverage,round(float(coverage) + 2.0,6))
dct_cluster_sol[current_cluster]['samples'] = nested_dic
print(dct_cluster_sol)
for i in dct_cluster_sol:
print(i)
for j in dct_cluster_sol[i]:
for k in dct_cluster_sol[i][j]:
print(k)
if k == id:
print(k)
covi = 0.0
covi = covi + float(coverage)
dct_cluster_sol[i][j][k] = float(covi)
And this is the error I obtain:
Traceback (most recent call last):
File "biostars.py", line 39, in <module>
dct_cluster_sol[i][j][k] = float(covi)
TypeError: 'set' object does not support item assignment
Thank you!
It looks like you are trying to parse CD-HIT output. Maybe you prefer to do it on your own, but there are already scripts to do that. I recommend ParseCDHIT.py in this collection of tools:
https://github.com/jrjhealey/bioinfo-tools
Searching GitHub for
parse cdhit
will produce many other results, but I linked the one I know to work.Hi Mensur! Yes, that's exactly what I am trying to do. Do you know where can I find an example of the output format for the script? Thank you so much!
Not sure what you are asking here. Is it about the parsing script I recommended? If so, it is easy enough for you to run it and find out, as it has minimal outside dependencies. The output is not exactly what you want, but it should be relatively easy to tailor the original script.
A few lines from the output:
It also creates many fasta files containing sequences from each cluster.
Not sure, where exactly your error is, but what you are trying to output is essentially JSON, so you can probably use
json.dumps()
instead and save yourself a headache.For any tool you plan to publish or any script that will not be a one-off, consider using Pydantic schemas or dataclasses in conjunction with Pydantic for the validation of complex structures, the serialization of values and for writing clean output.
Is there a particular reason you need it in a nested dictionary? I think you are probably making your life unnecessarily hard by trying to do arithmetic over multiple entries and then concoct a dict format for it.
It's also not clear where
SOL_1_10':0}
is coming from, as it isn't represented in the clusters anywhere? Is all missing data to be treated as a 0 coverage?