Question

Multiple alignment processing and Visualization tool

1

Entering edit mode

10.3 years ago

ishengomae ▴ 110

I am looking for tools that would help process multiple alignment of protein sequences and extract polymorphic sites only--based on reference sequence. Well, what I have in mind is an output of something matrix-like like so:

Position             2    10     20    45     60     63
RefSeq               Val  Leu    Ala   Ser    Phe    Thr 
Seq1                 Met               Tyr
Seq2                             Gly                Pro
Seq3                      Ala                 Arg

Any suggestion for tools or scripts(preferably python) that can help me achieve this fast?

Thanks.

alignment • 2.3k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.3 years ago by ishengomae ▴ 110

Ram · Accepted Answer · 2014-08-08

2

Entering edit mode

10.3 years ago

Cytosine ▴ 460

This looks like a task you can write up a script by yourself before finding a tool to do the job for you.

Assuming you have sequences saved in a python list and the reference is the last one, here's something you could do...

seqs = do_your_magic_to_get_the_sequences()

ref = seqs[-1]
diffs_top, diffs = [], []
all_diffs = set()

for s in seqs[:-1]:
    diffs = []
    for i, c in enumerate(s):
        if s[i] != ref[i]:
            diffs.append(i)
            all_diffs.add(i)
    diffs_top.append(diffs)

for d in all_diffs:
    print "%-3d" % int(d+1),
print

for c in ref:
    print "%-3s" % c,
print

for i, s in enumerate(seqs[:-1]):
    for j, c in enumerate(s):
        if j in diffs_top[i]:
            print "%-3s" % c,
        else:
            print "%-3s" % " ",
    print

The print out of an example data would then be like:

1   2   3   4   5   6   7   8   9   10  11  12  13  14
M   L   Y   M   V   A   R   V   A   Y   K   K   N   P
V   P   R           F   M           G           L
    A   B   K   L   P   P   K       B               C
N       P       Y       F       C       C   L       C

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.3 years ago by Cytosine ▴ 460

0

Entering edit mode

Excellent and elegant, thanks very much.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.3 years ago by ishengomae ▴ 110

0

Entering edit mode

Hello @Cytosine, more help please.

I implemented this code with a snippet of my data and I think I am just a small step away from achieving my goal.

reference_string = 'MAHEWGPQRLAGGQPQAS'
string1 = 'MAQQWSLQRLAGRHPQDS'
string2 = 'MAQRWGAHRLTGGQLQDT'
string3 = 'MAQRWGPHALSGVQAQDA'

string_list = [string1, string2, string3]

reference = reference_string
diffs_top, diffs = [], []
all_diffs = set()

for s in string_list:
    diffs = []
    for i, c in enumerate(s):
        if s[i] != reference[i]:
            diffs.append(i)
            all_diffs.add(i)
    diffs_top.append(diffs)

for d in all_diffs:
    print str(int(d+1)),
print

for c in reference:
    print str(c),
print

for i, s in enumerate(string_list):
    for j, c in enumerate(s):
        if j in diffs_top[i]:
            print str(c),
        else:
            print str(' '),
    print

This would give me:

3 4 6 7 8 9 11 13 14 15 17 18
M A H E W G P Q R L A G G Q P Q A S
    Q Q   S L           R H     D  
    Q R     A H     T       L   D T
    Q R       H A   S   V   A   D A

The positional information at the top is crucial and I only want to extract corresponding reference residues and those aligned to those positions, not the entire length of the alignment. I tried to figure how to modify this code to do that, but so far unsuccessfully.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.3 years ago by ishengomae ▴ 110

0

Entering edit mode

@Cytosine,

Finally I have found a solution. You helped me so much that I can't help sharing my solution with you -- which is just a small modification of your code. Again, thanks so much. This code does the job.

reference_string = 'MAHEWGPQRLAGGQPQAS'
string1 = 'MAQQWSLQRLAGRHPQDS'
string2 = 'MAQRWGAHRLTGGQLQDT'
string3 = 'MAQRWGPHALSGVQAQDA'

string_list = [string1, string2, string3]

reference = reference_string
diffs_top, diffs = [], []
residue_top_dict, residue_little_dict = {}, {}
all_diffs = set()

for n, s in enumerate(string_list):
    residue_little_dict = {}
    for i, c in enumerate(s):
        residue_little_dict[i] = c
    residue_top_dict[n] = residue_little_dict

for s in string_list:
    diffs = []
    for i, c in enumerate(s):
        if s[i] != reference[i]:
            diffs.append(i)
            all_diffs.add(i)
    diffs_top.append(diffs)

for d in all_diffs:
    print str(int(d+1)),
print

for d in all_diffs:
    for i, c in enumerate(reference):
        if i == d:
            print c,
print

reference_dict = {}
for d in all_diffs:
    for i, c in enumerate(reference):
        if i == d:
            reference_dict[i] = c

for k, v in residue_top_dict.iteritems():
    for m, n in v.iteritems():
        for key, value in reference_dict.iteritems():
            if m == key and n != value:
                print n,
            elif m == key and n == value:
                print '*',
            else:
                pass
    print

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.3 years ago by ishengomae ▴ 110