Question

Calculating mutual information from FASTA files in Python

0

Entering edit mode

4.0 years ago

nameuser ▴ 30

I'm currently trying to write a program that will calculate Mutual Information given text files of nucleotide distributions.

Below is the MI program, as seen I am currently stuck after finding marginal distribution, then dividing the values by the number of bins (8) in my Sort-Seq experiment:

import pandas as pd
import numpy as np
import sys

filename = sys.argv[1]

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

df = pd.read_csv(filename, header=0, sep= ',', skipinitialspace=True)
headers = ['A', 'T', 'G', 'C']
df.columns = headers

df.head(499)

df['max'] = df[['A', 'T', 'G', 'C']].max(axis=1)
df['sum'] = df[['A', 'T', 'G', 'C']].sum(axis=1)

df.loc[:,"A":"C"] = df.loc[:,"A":"C"].div(df["sum"], axis=0)
df['mutation_rate'] = (1-df['max']/df['sum'])
df['max2'] = df[['A', 'T', 'G', 'C']].max(axis=1)
df['sum2'] = df[['A', 'T',  'G', 'C']].sum(axis=1)
df['marginal_distribution']=(1-df['max2']/df['sum2'])

df.head()

#df.head()
#df['A/8'] = df['A'].div(8)
#df['T/8'] = df['T'].div(8)
#df['G/8'] = df['G'].div(8)
#df['C/8'] = df['G '].div(8)
#df.head() --> **8 bins of sequences** 

sys.stdout = open("stats_" + filename, "w")
print(df)
sys.stdout.close()

With the output

    A   T   G   C
0   0.000671    0.000471    0.00028     0.998578
1   0.000591    0.000319    0.000048    0.999042
2   0.999169    0.000351    0.000192    0.000288
3   0.000024    0.000351    0.000032    0.999593
4   0.999297    0.000184    0.000224    0.000296
5   0.00004     0.000184    0.000032    0.999744
6   0.999497    0.000064    0.000144    0.000295
7   0.999329    0.000256    0.000112    0.000303
8   0.000072    0.0002      0.000064    0.999665


 max     sum      mutation_rate
125032  125210  0.001422
125082  125202  0.000958
125107  125211  0.000831
125161  125212  0.000407
125122  125210  0.000703
125180  125212  0.000256
125149  125212  0.000503
125124  125208  0.000671
125170  125212  0.000335

max2     sum2
0.998578    1
0.999042    1
0.999169    1
0.999593    1
0.999297    1
0.999744    1
0.999497    1
0.999329    1
0.999665    1

marginal_distribution
0.001422
0.000958
0.000831
0.000407
0.000703
0.000256
0.000503
0.000671
0.000335

A/numberOfBins  T/numberOfBins  G/numberOfBins  C/numberOfBins
0.000084    0.000059    0.000035    0.124822
0.000074    0.00004     0.000006    0.12488
0.124896    0.000044    0.000024    0.000036
0.000003    0.000044    0.000004    0.124949
0.124912    0.000023    0.000028    0.000037
0.000005    0.000023    0.000004    0.124968
0.124937    0.000008    0.000018    0.000037
0.124916    0.000032    0.000014    0.000038
0.000009    0.000025    0.000008    0.124958

As of right now, each bin is printed to its own output file as seen above. I will need to get specific values from each nucleotide at each specified position across the 8 bins I am using to properly carry out the calculation.

For instance, in the example output file above, 0.000084 is represented at position 1 for A.

When completing this in Excel I use the following formula. Where H294 = 0.000084. The other values in row 294 are the other bins at this exact location.

=H294*LOG(H294/(0.125*SUM(H294,N294,T294,Z294,AF294,AL294,AR294,AX294)),2)

How should I set up the input file? Can this be done in df / called with one line? What line of code do I need here?

python fasta dataframe pandas nucleotide • 815 views

ADD COMMENT • link 4.0 years ago by nameuser ▴ 30