Question

Calculating Averages For Windows

1

Entering edit mode

12.7 years ago

Rubal7 ▴ 40

Hello all,

I have columns of population genetic data, here's an example:

line    chromosome    fst_score
1    1    0.3
2    1    0.7
3    1    0.3
4    1    0.15
5    1    0.4
6    2    0.6
7    2    0.94
8    2    0.17
9    2    0.19

I want to calculate the average of the values in column 3 (the Fst score) but not for the whole column, I need to bin windows of certain sizes. To start with I'd just like to know how to calculate the average for 10 rows of data column 3. I know that to calculate the average for the whole column I can do something like this:

for line in fileObj:
    lineList = line.strip().split()
    if lineList[1] in ['1100', '1200', '1300']:
        list_to_average = [float(s) for s in lineList[2:]]
        average = sum(list_to_average)/len(list_to_average)

but I am not sure how to initiate a reliable counter that would do this for every 10 rows and output this along with the value in the first column (so that I know which lines the average comes from).

This is for chromosome data so in the real file the values in column 2 are chromosome positions and I will use these to define the number of rows that need to be average together as 1mb. But this is trickier as I will need to count when the distance between rows has reached 1mb as I iterate through the file. For now I would just like to solve the first challenge of calculating an average for every 10 lines in a file.

Please let me know if I can format the question better or if there is already an answer on this forum (I have looked)

Any help is appreciated. I'm still getting to grips with programming!

Thanks

Rubal7

snp fst python windows • 3.6k views

ADD COMMENT • link 12.7 years ago by Rubal7 ▴ 40

0

Entering edit mode

Are you committed to a Python solution? This would be very easy in R.

ADD REPLY • link 12.7 years ago by Neilfws 49k

0

Entering edit mode

I'm open to R solutions too, especially if it is simpler. But I am even less familiar with R syntax

ADD REPLY • link 12.7 years ago by Rubal7 ▴ 40

Ram · Answer 1 · 2012-03-03

3

Entering edit mode

12.7 years ago

Istvan Albert 101k

In these types of problems it is often advantageous to transform your data into a numpy data structure after which you can use many of the numpy functions to compute the quantities of interest. This has multiple benefits as far as simplicity and speed goes.

For a moving window average you could then use the numpy.convolve function as explained here:

http://argandgahandapandpa.wordpress.com/2011/02/24/python-numpy-moving-average-for-data/

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 12.7 years ago by Istvan Albert 101k

0

Entering edit mode

thanks, i'll investigate numpy

ADD REPLY • link 12.7 years ago by Rubal7 ▴ 40

score 3 · Answer 2 · 2012-03-03

3

Entering edit mode

12.7 years ago

Adrian ▴ 700

You could also just do this directly with a for loop.

counter = 0
sum = 0

for line in fileObj:
    linedata = line.strip().split()
    counter += 1
    id = linedata[1]
    sum += float(linedata[2])

    if counter == 10:
        average = sum / counter
        print "Average of last 10 values for %s: %f" % (id, average)
        # reset counter   
        sum = 0
        counter = 0

ADD COMMENT • link 12.7 years ago by Adrian ▴ 700

0

Entering edit mode

thanks that's great

ADD REPLY • link 12.7 years ago by Rubal7 ▴ 40

0

Entering edit mode

I understand the logic to this answer but when i run the script it runs but produces no output. Any ideas why? Here is how I formatted it (I checked and the file exists):

inputfile = open("/Users/..filename", "r")

counter = 0 sum = 0

for line in inputfile: linedata = line.strip().split() counter += 1 id = linedata[1] sum += float(linedata[2]) print inputfile

if counter == 10:
    average = sum / counter
    print "Average of last 10 value

ADD REPLY • link 12.7 years ago by Rubal7 ▴ 40

score 0 · Answer 3 · 2012-03-04

0

Entering edit mode

12.7 years ago

Rubal7 ▴ 40

Here is my attempted solution based on Adrian's answer. Unfortunately even though I understand the logic and think it should be working, the script runs and finishes without printing anything. I checked the inputfile exists and has more than 10 rows of data. Can anyone identify the problem in the script? Otherwise I guess it must be a problem with the formatting of the inputfile.

Thanks in advance!

inputfile = open("inputfilepath", "r")

counter = 0
sum = 0

for line in inputfile:
    linedata = line.strip().split()
    counter += 1
    id = linedata[1]
    sum += float(linedata[2])
    print inputfile

    if counter == 10:
        average = sum / counter
        print "Average of last 10 values for %s: %f" % (id, average)
        # reset counter   
        sum = 0
        counter = 0

ADD COMMENT • link 12.7 years ago by Rubal7 ▴ 40

0

Entering edit mode

I did a fast check and the script works fine for me. The only thing that should be implemented is to skip the first line of the file if it contains a header. This can be done in different ways. If you not do this, you will get an error from float(linedata[2]).