Hi everyone,
I have several fasta protein alignments (~5000) and I would want to identify polymorphic positions plus the aminoacid residues that change between sequences. I have tried to write a code myself, but it has been very difficult (I'm new in programming), and I looked on BioPython but I hasn't found anything yet. I want something like:
Protein alignment:
> sp1 MQGAAYMQAAAYYMQA
> sp2 MQGAARMQGAAYYMQA
> sp3 MQGAARMQGAAYYMQM
> sp4 MQGAARMQGAAYYMQA
> sp5 MQGAARMQAAAYYMQA
^ ^ ^
In the example above, the alignment has 3 polymorphic positions (marked with ^). The first one is located on the 6th position, the second one has the 9th position and the third one the 16th position. A common notation of a polymorphic site coult be as follows: R6Y, which means that a change occured in the 6th position from an R to a Y. The direction of the change (R->Y or Y->R) is based on the most frequent letter at that position. So, in this case R has the highest frequency and one can infer that the direction is R->Y.
As you can see, the 6th and 16th positions have single changes (the different letter is at frequency of 1). However, the 9th position has two sequences (sp1 and sp5) with the change. I would like two distinguish between this two types of polymorphisms. Thus, in this case I woul like an output something like this:
Output :
# Alignment #1
# Single polymorphisms:
# R6Y: sp1
# A16M: sp3
# Non-single polymorphisms:
# G9A: sp1, sp5
Any suggestion is very appreciated, thanks!!
Hi @Rodrigo,
Thank you very much! This is incredibly useful. I just had to added
.decode('UTF-8')
when concatenatingmain_letter
andletter
as follows:change = main_letter.decode('UTF-8') + str(column + 1) + letter.decode('UTF-8')
, because otherwise the script returns this error message: TypeError: can't concat numpy.bytes_ to str.best