Hi everyone,
I need some advice: I am trying to match the residue numbers from a PDB with an alignment (from multiple sequences alignment - MSA, from PROSITE). But there are some difficulties :
- The protein sequence in the MSA can be smaller than the PDB sequence
- The PDB sequence can be a little different because of mutation (sometimes needed for crystallization)
- I don't want to realign the two sequences because the alignment from PROSITE has to stay untouched.
Here's a small example :
pdbSequence = "MTGLKILYH"
alignment = "GPKI---LYH"
pos = get_alignment_position(pdbSequence,alignment)
print(pos)
[-2,-1,0,1,2,3,7,8,9]
Here the pos
start at '-2' because MT
is not included in the alignment from PROSITE, and even if you have a mutation at pos[1]
it still math (GL
instead of GP
).
Here's a full example :
pdbSequence = "QSSGSSGYGSEKKGYLLKKSDGIRKVWQRRKCSVKNGILTISHATSNRQPAKLNLLTCQVKPNAEDKKSFDLISHNRTYHFQAEDEQDYVAWISVLTNSKEEALTMAFSGPSSG"
alignment = "GSEKKGYLLKKSDG.......................IRKVWQRRKCSVKN..................GILTISHAT....................................................................................................................................................................................----SNRQPAKLNLLTCQVKPNAE----..................................................................................................................................DKKSFDLISHNR-.........................-TYHFQAEDE.............QDYVAWISVLTNSKE"
get_alignment_position(pdbSequence, alignment)
[-8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 69, 70, 71, .........]
For now, I managed to do something with biopython (to read the alignment) and difflib for matching my two sequences. but there is so much 'special case' to deal with that I am afraid to miss something and have bad data... SO if you know a method or an algorithm I take it =D
Best regards,
PS: Here's my function for now (adapted from my code):
def get_sequence_position(sequence, alignement):
alignment = str(list(recordDict.values())[0].seq)
matcher = difflib.SequenceMatcher(a=sequence, b=alignment)
match = matcher.find_longest_match(0, len(matcher.a), 0, len(matcher.b))
d = difflib.Differ()
diff = list(d.compare(alignment, sequence))
seqInAliSize = len(alignment.replace(" ",'').replace(".",''))
position = []
#first search first match. it will be index 0!
for i in range(len(diff)):
if diff[i][0] == ' ':
indexbase = -i
break
mutationFound = False
nbMutation = 0
for i in range(len(diff)):
pos=indexbase+i-nbMutation
match = diff[i]
if mutationFound :
mutationFound =False #reset the flag
continue
if match[0] == '+': #not in the alignement
position.append(pos)
elif match[0] == ' ': #MATCH
if len(position)==0: #if it is the first match, we want to be SURE that the beginin is not a "false" beginin.
#we can have 5% of chance that the first match is a false match (1/20) but 0.0125% that the 3 first match are wrong.
if diff[i+1][0] == ' ' and diff[i+2][0] == ' ' and diff[i+3][0] == ' ':
position.append(pos)
else:
continue
else:
position.append(pos)
elif match[0] == '-' :#not in the sequence. DEAL WITH MUTATIONS!
if match[2] not in ['-','.']:
#position.append(pos)
position.append(pos)
mutationFound=True
nbMutation+=1
return position
Doesn't Jalview do this through its GUI? perhaps you don't want a manual procedure...
To do it on Jalview I think I have to align the two sequences first, but if I do so, my original alignment will be changed, and Indeed I have hundred of sequence to match so it can be long with Jalview :-)
The Idea behind is to be able to select a loop on a protein and then apply this selection on every protein of its family (eg : I know that loop 2 is residue 20 to 30, which correspond to alignment positions 30 to 40 because of some gaps, so then I will select all amino acids between 30 and 40 in the family alignment)
Hi, just for the sake of clarification:
The output in alignment is the result of the ScanProsite tool, correct?
In that case the output in FASTA format looks like this for a protein similar to yours, giving its accession:
As you can see, the fasta header contains the matching position (324-416) in the query sequence, so the task of locating the motif start should be rather straight-forward if you have retained the output data.
Hei :-)
I download the Fasta from PROSITE for the whole family and then I match on the Uniprot GeneName. The starting position is also in the FASTA header, but I do not want to localize the position of the amino acid in the protein, but in the alignment.
Let's take for example the PDB 2DA0 (the PDB corresponding to the sequence given in the example): I want to make statistics on amino acids of HELIX2 let's say, which correspond to WQRRKCSV (26-33 in the sequence). But to make statistics, I need a lot of data, So I will use every structure that I can get, and I will enrich this with sequences without structures (given by the PROSITE family alignment). But to make sure that I take to right amino acids, I need a multiple sequence alignment in the family. So from this MSA, the corresponding residues are 41-48. Since every amino acid will have an index based on the alignment, I can select all amino acids in the alignment.
It is the Alignment based indexing that I try to generate. But the problem here is that I take the sequence from the PDB file (I take all Carbon Alphas and keep the residue name), there are some modifications in the amino acids. so the matching is not an easy task...