Entering edit mode
13 months ago
star
▴
350
I have a protein alignment data table like the one below. I would like to know how to calculate the number of differences for each amino acid position for query 1 vs other queries.
for example: the protein sequence starts with "ME" and finishes with "HL". in position 5 there is a difference between this query which is "M" compared to query 1, which is "V" . Then I would expect a data frame like :
df <- data.frame(difference=c(0,0,0,0,1,........))
Input:
query amino_acids
1 lcl|Query_10001 MEKIVLLFAIVSLVKSDQICIGYHANNSTEQVDTIMEKNVTVTHAQDILEKKHNGKLCDL
2 lcl|Query_10002 MEKIVLLLSVVSLVKSDQICIGYHANNSTEQVDTIMEKNVTVTHAQDILEKTHNGKLCDL
3 lcl|Query_10003 MEKIMLLLAATGLVKSDHICIGYHANNSTKQVDTIMEKNVTVTHAQDILEKTHNGKLCDL
4
5 lcl|Query_10001 DGVKPLILRDCSVAGWLLGNPMCDEFINVPEWSYIVEKANPVNDLCYPGDFNDYEELKHL
6 lcl|Query_10002 NGVKPLILKDCSVAGWLLGNPMCDEFISVPEWSYIVERANPANDLCYPGNLNDYEELKHL
7 lcl|Query_10003 NGVKPLILKDCSVAGWLLGNPMCDEFINVPEWSYIVEKANPANGLCYPGSFNDYEELKHL
Thank you in advance for any help!
You're looking for residue level conservation scores, what you've posted here is an XY problem. Unless you need to use R to do this, there are better tools out there.