Question

How to count the number of shared (the same) nucleotides between any two sequences in one alignment

0

Entering edit mode

10.5 years ago

YMTIO • 0

Hi everyone,

I want to count the shared (the same) nucleotides between any two sequences in one alignment. I want to get a matrix including numbers of nucleotides shared by any two sequences in a batch way. However, I can not program. Anyone can help me to solve it? Thanks!

sequence-alignment • 4.8k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by YMTIO • 0

0

Entering edit mode

Thanks! but I want to get a matrix including the shared number between all pairwise in a batch way.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by YMTIO • 0

0

Entering edit mode

Alignment tools like mafft, muscle etc. does this for you if you are not comfortable with writing a script to compare all possible k-mers/aligned bases between all pairwise sequences to construct a distance matrix.

Follow the steps given in this link: http://mafft.cbrc.jp/alignment/software/treeout.html

(keep in mind that your input should be the raw fasta sequences)

Other wise you could use (slight modification required) this python code to do all possible pairwise comparisons in your alignment file to get the distance matrix:

import numpy as np
import matplotlib.pylab as plt

def get_difference(x,y):
    return sum(ele_x != ele_y for ele_x, ele_y in zip(x,y))

my_list = []

f3= open("GBSraw_6lines.txt", "r")
lines3 = f3.readlines()
for line in lines3:
                line=line.split('\t')
                #line=line.split('\t')

                if not line[0].startswith("Taxa"):
                                #print line [1:-1]
                                #print(''.join(line[1:-1]))
                                #print line
                                my_list.append(''.join(line[1:-1]))
#print my_list

n = len(my_list)
l= len(my_list[0])

my_array = np.zeros((n,n))
#
for i, ele_1 in enumerate(my_list):
    for j, ele_2 in enumerate(my_list):
        if j >= i:
            break # Since the matrix is symmetrical we don't need to
                  # calculate everything
        #difference = 1.0-((int(l))-(int(get_difference(ele_1, ele_2)))/int(l))
        difference = 1-((l-get_difference(ele_1, ele_2))/float(l))
        print difference
        #print (int(n))
        my_array[i, j] = difference
        my_array[j, i] = difference

line_names = ['','B73','CML277','HP301','Mo17','P39','Tx303']

#line_names2 = ['','Tx303','P39','Mo17','HP301','CML277','B73'] 
print my_array

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.set_aspect('equal')
plt.imshow(my_array, interpolation='nearest', cmap=plt.cm.ocean)

for y in range(my_array.shape[0]):
    for x in range(my_array.shape[1]):
        plt.text(x , y , '%.4f' % my_array[y, x],
                 horizontalalignment='center',
                 verticalalignment='center',
                 )

plt.colorbar()
ax.xaxis.set_ticks_position('top')
ax.set_xticklabels(line_names, minor=False)
ax.set_yticklabels(line_names, minor=False)
plt.show()

f3.close()

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Felix Francis ▴ 600

0

Entering edit mode

HI Felix

Thanks for your reply. However, i do not want to get the distance matrix. I just want to get the shared number matrix (numbers of nucleotides shared by any two sequences) similar to identity matrix.

ADD REPLY • link 10.5 years ago by YMTIO • 0

score 1 · Answer 1 · 2015-03-31

1

Entering edit mode

10.5 years ago

Felix Francis ▴ 600

Run this python code :

https://github.com/ffrancis/bioinformatic_algorithms/blob/master/codes/1_11_HammingDistance_pattern_match.py

You need to have the inputs p and q as each of the sequences in your algnment

ADD COMMENT • link 10.5 years ago by Felix Francis ▴ 600

score 0 · Answer 2 · 2015-03-31

0

Entering edit mode

10.5 years ago

Emily 24k

Clustal omega

ADD COMMENT • link 10.5 years ago by Emily 24k