I have a protein sequence file looks like this:
>102L:A MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL -------------------------------------------------------------------------------------------------------------------------------------------------------------------XX
The first one is the name of the sequence, the second one is the actual protein sequence, and the first one is the indicator that shows if there is any missing coordinates. In this case, notice that there is two "X" in the end. That means that the last two residue of the sequence witch are "NL" in this case are missing coordinates.
By coding in Python I would like to generate a table which should look like this:
- name of the sequence
- total number of missing coordinates (which is the number of X)
- the range of these missing coordinates (which is the range of the position of those X)
- the length of the sequence
- the actual sequence
So the final results should looks like this:
>102L:A 2 163-164 164 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
And my code looks like this so far:
total_seq = []
with open('sample.txt') as lines:
for l in lines:
split_list = l.split()
# Assign the list number
header = split_list[0] # 1
seq = split_list[1] # 5
disorder = split_list[2]
# count sequence length and total residue of missing coordinates
sequence_length = len(seq) # 4
for x in disorder:
counts = 0
if x == 'X':
counts = counts + 1
total_seq.append([header, seq, str(counts)]) # obviously I haven't finish coding 2 & 3
with open('new_sample.txt', 'a') as f:
for lol in total_seq:
f.write('\n'.join(lol))
I'm new in python, would anyone help please, thank you so much guys!
Have any of the answers/comments to your previous questions (links above) provided any assistance?
It helped. But for this I still don't understand how to solve number 2 and 3 in my goal. which is the total number of missing coordinates and the range of those missing coordinates.