Hello BioStar Community,
I have a CLUSTAL alignment file with 500 protein sequences whose format is as such:
S1 DERY .....
S2 RKH .... ....
S500 HERKKK ....
where S1,S2,....S500 are Uniprot codes.
However, each of the 500 entries appears multiple times because each individual line is limited to n characters. So, S1,S2,...S500 appear multiple times and I need to parse each component together in order to get each individual sequence stored as a single string. The code for doing this on a single sequence is straightforward, first I put all Uniprot codes into a list so that I can identify each protein sequence and then I use a snippet like
S1 = ''
for line in alignment:
if line.startswith('3NY8A'):
line = line.lstrip('3NY8A')
line = line.rstrip('\n')
line = line.rstrip('\t')
templateSeq += line
templateSeq = ''.join(templateSeq.split())
But S1 is only one of the 500 sequences and I need to automate this procedure for all of the 500 sequences. What I think is necessary is to write a function (e.g. sequencetostring) and then implement this function over all lines in the alignment file. My implementation is
seq = ''
def sequence_to_string():
''' make each MSA sequence a string '''
global seq
line = line.lstrip(unique_uniprot_id[i])
line = line.rstrip('\n')
line = line.rstrip('\t')
seq += line
seq = ''.join(seq.split())
return seq
And then loop over the file with:
for i in range(len(unique_uniprot_id)):
sequence_to_string()
But this does not return anything. Is there a way to modify my function so that it builds 500 strings (maybe I need to put them in a list?), one per protein sequence? Many thanks in advance!
Regards, Spyros
That would be my suggestion as well.
Mine too! BioPython rocks!! ;) Do something like the following to convert from CLUSTAL to FASTA: