At this tutorial link for Biopython Cookbook, I saw this:
Next, we read in the 16S ribosomal RNA gene sequence of Escherichia coli and Bacillus subtilis (provided in Tests/scoring_matrices/ecoli.fa and Tests/scoring_matrices/bsubtilis.fa), and align them to each other:
>>> from Bio import SeqIO >>> sequence1 = SeqIO.read('ecoli.fa', 'fasta') >>> sequence2 = SeqIO.read('bsubtilis.fa', 'fasta') >>> alignments = aligner.align(sequence1.seq, sequence2.seq)
The number of alignments generated is very large:
>>> len(alignments) 1990656
However, as they only differ trivially from each other, we arbitrarily choose the first alignment, and count the number of each substitution:
>>> alignment = alignments[0] >>> from Bio.Align.substitution_matrices import Array >>> frequency = Array("ACGT", dims=2) >>> for (start1, end1), (start2, end2) in zip(*alignment.aligned): ... seq1 = sequence1[start1:end1] ... seq2 = sequence2[start2:end2] ... for c1, c2 in zip(seq1, seq2): ... frequency[c1, c2] += 1 ... >>> print(frequency) A C G T A 307.0 19.0 34.0 19.0 C 15.0 280.0 25.0 29.0 G 34.0 24.0 401.0 20.0 T 24.0 36.0 20.0 228.0 <BLANKLINE>
We normalize against the total number to find the probability of each substitution, and create a symmetric matrix:
>>> import numpy >>> probabilities = frequency / numpy.sum(frequency) >>> probabilities = (probabilities + probabilities.transpose()) / 2.0 >>> print(probabilities.format("%.4f")) A C G T A 0.2026 0.0112 0.0224 0.0142 C 0.0112 0.1848 0.0162 0.0215 G 0.0224 0.0162 0.2647 0.0132 T 0.0142 0.0215 0.0132 0.1505 <BLANKLINE>
The background probability is the probability of finding an A, C, G, or T nucleotide in each sequence separately. This can be calculated as the sum of each row or column:
>>> background = numpy.sum(probabilities, 0) >>> print(background.format("%.4f")) A 0.2505 C 0.2337 G 0.3165 T 0.1993 <BLANKLINE>
The number of substitutions expected at random is simply the product of the background distribution with itself:
>>> expected = numpy.dot(background[:,None], background[None, :]) >>> print(expected.format("%.4f")) A C G T A 0.0627 0.0585 0.0793 0.0499 C 0.0585 0.0546 0.0740 0.0466 G 0.0793 0.0740 0.1002 0.0631 T 0.0499 0.0466 0.0631 0.0397 <BLANKLINE>
The scoring matrix can then be calculated as the logarithm of the odds-ratio of the observed and the expected probabilities:
>>> oddsratios = probabilities / expected >>> scoring_matrix = numpy.log2(oddsratios) >>> print(scoring_matrix) A C G T A 1.7 -2.4 -1.8 -1.8 C -2.4 1.8 -2.2 -1.1 G -1.8 -2.2 1.4 -2.3 T -1.8 -1.1 -2.3 1.9 <BLANKLINE>
The matrix can be used to set the substitution matrix for the pairwise aligner:
>>> aligner.substitution_matrix = scoring_matrix
A ValueError is triggered if the Array objects appearing in a mathematical operation have different alphabets:
>>> from Bio.Align.substitution_matrices import Array >>> d = Array("ACGT") >>> r = Array("ACGU") >>> d + r Traceback (most recent call last): ... ValueError: alphabets are inconsistent
when and why would I want to make my own substitution matrix rather than use one that is premade?
You might want to use a highly organism-specific substitution matrix. The commonly available ones are based on model organisms and existing databases, which means they may not be optimal for a specific organism, especially if it is novel. Not all organisms evolve in exactly the same way.
How often people working on such organisms have good ground truth data about its substitution patterns though I am not sure (I suspect rarely).