I'm currently trying to write a program that will calculate Mutual Information given text files of nucleotide distributions.
Below is the MI program, as seen I am currently stuck after finding marginal distribution, then dividing the values by the number of bins (8) in my Sort-Seq experiment:
import pandas as pd
import numpy as np
import sys
filename = sys.argv[1]
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
df = pd.read_csv(filename, header=0, sep= ',', skipinitialspace=True)
headers = ['A', 'T', 'G', 'C']
df.columns = headers
df.head(499)
df['max'] = df[['A', 'T', 'G', 'C']].max(axis=1)
df['sum'] = df[['A', 'T', 'G', 'C']].sum(axis=1)
df.loc[:,"A":"C"] = df.loc[:,"A":"C"].div(df["sum"], axis=0)
df['mutation_rate'] = (1-df['max']/df['sum'])
df['max2'] = df[['A', 'T', 'G', 'C']].max(axis=1)
df['sum2'] = df[['A', 'T', 'G', 'C']].sum(axis=1)
df['marginal_distribution']=(1-df['max2']/df['sum2'])
df.head()
#df.head()
#df['A/8'] = df['A'].div(8)
#df['T/8'] = df['T'].div(8)
#df['G/8'] = df['G'].div(8)
#df['C/8'] = df['G '].div(8)
#df.head() --> **8 bins of sequences**
sys.stdout = open("stats_" + filename, "w")
print(df)
sys.stdout.close()
With the output
A T G C
0 0.000671 0.000471 0.00028 0.998578
1 0.000591 0.000319 0.000048 0.999042
2 0.999169 0.000351 0.000192 0.000288
3 0.000024 0.000351 0.000032 0.999593
4 0.999297 0.000184 0.000224 0.000296
5 0.00004 0.000184 0.000032 0.999744
6 0.999497 0.000064 0.000144 0.000295
7 0.999329 0.000256 0.000112 0.000303
8 0.000072 0.0002 0.000064 0.999665
max sum mutation_rate
125032 125210 0.001422
125082 125202 0.000958
125107 125211 0.000831
125161 125212 0.000407
125122 125210 0.000703
125180 125212 0.000256
125149 125212 0.000503
125124 125208 0.000671
125170 125212 0.000335
max2 sum2
0.998578 1
0.999042 1
0.999169 1
0.999593 1
0.999297 1
0.999744 1
0.999497 1
0.999329 1
0.999665 1
marginal_distribution
0.001422
0.000958
0.000831
0.000407
0.000703
0.000256
0.000503
0.000671
0.000335
A/numberOfBins T/numberOfBins G/numberOfBins C/numberOfBins
0.000084 0.000059 0.000035 0.124822
0.000074 0.00004 0.000006 0.12488
0.124896 0.000044 0.000024 0.000036
0.000003 0.000044 0.000004 0.124949
0.124912 0.000023 0.000028 0.000037
0.000005 0.000023 0.000004 0.124968
0.124937 0.000008 0.000018 0.000037
0.124916 0.000032 0.000014 0.000038
0.000009 0.000025 0.000008 0.124958
As of right now, each bin is printed to its own output file as seen above. I will need to get specific values from each nucleotide at each specified position across the 8 bins I am using to properly carry out the calculation.
For instance, in the example output file above, 0.000084 is represented at position 1 for A.
When completing this in Excel I use the following formula. Where H294 = 0.000084. The other values in row 294 are the other bins at this exact location.
=H294*LOG(H294/(0.125*SUM(H294,N294,T294,Z294,AF294,AL294,AR294,AX294)),2)
How should I set up the input file? Can this be done in df / called with one line? What line of code do I need here?