Question

Mutual Information from Nucleotide Distribution in Python

3

Entering edit mode

4.1 years ago

nameuser ▴ 30

This question has been removed from this site -- please see stackoverflow if interested.

Previous content restored by Ram from Google Cache

Hi there,

I'm currently trying to write a program that will calculate the mutation rate given text files of nucleotide distributions. I am hoping to automate the process of calculating mutual information in Excel to python. I'm stuck at this step in the calculation.....

An example of an input file is as follows

A,T,G,C
84 , 59 , 35 , 125032 
74 , 40 , 6 , 125082 
125107 , 44 , 24 , 36 
3 , 44 , 4 , 125161 
125122 , 23 , 28 , 37 
5 , 23 , 4 , 125180 
125149 , 8 , 18 , 37 
125124 , 32 , 14 , 38 
9 , 25 , 8 , 125170

The program:

import pandas as pd
import sys

filename = sys.argv[1]
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
col = ['A', 'T', 'G', 'C']
df = pd.read_csv(filename, skipinitialspace=True, usecols=col)
df.head(287)
df['max'] = df[['A', 'T', 'G', 'C']].max(axis=1)
df['sum'] = df[['A', 'T', 'G', 'C']].sum(axis=1)
df.loc[:,"A":"C"] = df.loc[:,"A":"C"].div(df["sum"], axis=0)
df['mutation_rate'] = (1-df['max']/df['sum'])
df['max2'] = df[['A', 'T', 'G', 'C']].max(axis=1)
df['sum2'] = df[['A', 'T',  'G', 'C']].sum(ax

is=1)
df['marginal_distribution']=(1-df['max2']/df['sum2'])
df.head()

df.head()
numberOfBins = sys.argv[2]
df['A/numberOfBins'] = df['A'].div(8)
df['T/numberOfBins'] = df['T'].div(8)
df['G/numberOfBins'] = df['G'].div(8)
df['C/numberOfBins'] = df['C'].div(8)
df.head()

With the output

    A   T   G   C
0   0.000671    0.000471    0.00028 0.998578
1   0.000591    0.000319    0.000048    0.999042
2   0.999169    0.000351    0.000192    0.000288
3   0.000024    0.000351    0.000032    0.999593
4   0.999297    0.000184    0.000224    0.000296
5   0.00004     0.000184    0.000032    0.999744
6   0.999497    0.000064    0.000144    0.000295
7   0.999329    0.000256    0.000112    0.000303
8   0.000072    0.0002      0.000064    0.999665



 max    sum mutation_rate
125032  125210  0.001422
125082  125202  0.000958
125107  125211  0.000831
125161  125212  0.000407
125122  125210  0.000703
125180  125212  0.000256
125149  125212  0.000503
125124  125208  0.000671
125170  125212  0.000335

max2    sum2
0.998578    1
0.999042    1
0.999169    1
0.999593    1
0.999297    1
0.999744    1
0.999497    1
0.999329    1
0.999665    1

marginal_distribution
0.001422
0.000958
0.000831
0.000407
0.000703
0.000256
0.000503
0.000671
0.000335

A/numberOfBins  T/numberOfBins  G/numberOfBins  C/numberOfBins
0.000084    0.000059    0.000035    0.124822
0.000074    0.00004     0.000006    0.12488
0.124896    0.000044    0.000024    0.000036
0.000003    0.000044    0.000004    0.124949
0.124912    0.000023    0.000028    0.000037
0.000005    0.000023    0.000004    0.124968
0.124937    0.000008    0.000018    0.000037
0.124916    0.000032    0.000014    0.000038
0.000009    0.000025    0.000008    0.124958

I am attempting to solve for Shannon entropy/Mutual information. Thank you SO much.

entropy • 945 views

ADD COMMENT • link updated 4.0 years ago by Ram 44k • written 4.1 years ago by nameuser ▴ 30

1

Entering edit mode

In your loop:

row = list(map(int, row)) 
print(1 - max(row) / sum(row))

Edit: Note: the text (esp. the code) of the question appears to have changed since the initial posting, so this comment doesn't seem to make sense any more.

ADD REPLY • link 4.0 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Hello nameuser,

Do not redact content after you've received feedback on a post. This is inconsiderate and such behavior can lead to suspension of your user account.

Please point to the StackOverflow post that you are referring to. In the meantime, I'll be restoring the content of this post from Google Cache.

ADD REPLY • link 4.0 years ago by Ram 44k