Multiple alignment processing and Visualization tool
1
1
Entering edit mode
10.3 years ago
ishengomae ▴ 110

I am looking for tools that would help process multiple alignment of protein sequences and extract polymorphic sites only--based on reference sequence. Well, what I have in mind is an output of something matrix-like like so:

Position             2    10     20    45     60     63
RefSeq               Val  Leu    Ala   Ser    Phe    Thr 
Seq1                 Met               Tyr
Seq2                             Gly                Pro
Seq3                      Ala                 Arg

Any suggestion for tools or scripts(preferably python) that can help me achieve this fast?

Thanks.

alignment • 2.3k views
ADD COMMENT
2
Entering edit mode
10.3 years ago
Cytosine ▴ 460

This looks like a task you can write up a script by yourself before finding a tool to do the job for you.

Assuming you have sequences saved in a python list and the reference is the last one, here's something you could do...

seqs = do_your_magic_to_get_the_sequences()

ref = seqs[-1]
diffs_top, diffs = [], []
all_diffs = set()

for s in seqs[:-1]:
    diffs = []
    for i, c in enumerate(s):
        if s[i] != ref[i]:
            diffs.append(i)
            all_diffs.add(i)
    diffs_top.append(diffs)

for d in all_diffs:
    print "%-3d" % int(d+1),
print

for c in ref:
    print "%-3s" % c,
print

for i, s in enumerate(seqs[:-1]):
    for j, c in enumerate(s):
        if j in diffs_top[i]:
            print "%-3s" % c,
        else:
            print "%-3s" % " ",
    print

The print out of an example data would then be like:

1   2   3   4   5   6   7   8   9   10  11  12  13  14
M   L   Y   M   V   A   R   V   A   Y   K   K   N   P
V   P   R           F   M           G           L
    A   B   K   L   P   P   K       B               C
N       P       Y       F       C       C   L       C
ADD COMMENT
0
Entering edit mode

Excellent and elegant, thanks very much.

ADD REPLY
0
Entering edit mode

Hello @Cytosine, more help please.

I implemented this code with a snippet of my data and I think I am just a small step away from achieving my goal.

reference_string = 'MAHEWGPQRLAGGQPQAS'
string1 = 'MAQQWSLQRLAGRHPQDS'
string2 = 'MAQRWGAHRLTGGQLQDT'
string3 = 'MAQRWGPHALSGVQAQDA'

string_list = [string1, string2, string3]

reference = reference_string
diffs_top, diffs = [], []
all_diffs = set()

for s in string_list:
    diffs = []
    for i, c in enumerate(s):
        if s[i] != reference[i]:
            diffs.append(i)
            all_diffs.add(i)
    diffs_top.append(diffs)

for d in all_diffs:
    print str(int(d+1)),
print

for c in reference:
    print str(c),
print

for i, s in enumerate(string_list):
    for j, c in enumerate(s):
        if j in diffs_top[i]:
            print str(c),
        else:
            print str(' '),
    print

This would give me:

3 4 6 7 8 9 11 13 14 15 17 18
M A H E W G P Q R L A G G Q P Q A S
    Q Q   S L           R H     D  
    Q R     A H     T       L   D T
    Q R       H A   S   V   A   D A

The positional information at the top is crucial and I only want to extract corresponding reference residues and those aligned to those positions, not the entire length of the alignment. I tried to figure how to modify this code to do that, but so far unsuccessfully.

ADD REPLY
0
Entering edit mode

@Cytosine,

Finally I have found a solution. You helped me so much that I can't help sharing my solution with you -- which is just a small modification of your code. Again, thanks so much. This code does the job.

reference_string = 'MAHEWGPQRLAGGQPQAS'
string1 = 'MAQQWSLQRLAGRHPQDS'
string2 = 'MAQRWGAHRLTGGQLQDT'
string3 = 'MAQRWGPHALSGVQAQDA'

string_list = [string1, string2, string3]

reference = reference_string
diffs_top, diffs = [], []
residue_top_dict, residue_little_dict = {}, {}
all_diffs = set()

for n, s in enumerate(string_list):
    residue_little_dict = {}
    for i, c in enumerate(s):
        residue_little_dict[i] = c
    residue_top_dict[n] = residue_little_dict

for s in string_list:
    diffs = []
    for i, c in enumerate(s):
        if s[i] != reference[i]:
            diffs.append(i)
            all_diffs.add(i)
    diffs_top.append(diffs)

for d in all_diffs:
    print str(int(d+1)),
print

for d in all_diffs:
    for i, c in enumerate(reference):
        if i == d:
            print c,
print

reference_dict = {}
for d in all_diffs:
    for i, c in enumerate(reference):
        if i == d:
            reference_dict[i] = c

for k, v in residue_top_dict.iteritems():
    for m, n in v.iteritems():
        for key, value in reference_dict.iteritems():
            if m == key and n != value:
                print n,
            elif m == key and n == value:
                print '*',
            else:
                pass
    print
ADD REPLY

Login before adding your answer.

Traffic: 2683 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6