Hello all, I am looking for an excel wizard to help me with a messy task: I would like to take the average of values in one column for all those cells sharing an ID in another column BUT with the added stipulation that outliers are excluded based on the use of quartiles.
Here is an example of the spreadsheet:
A B C D
ABCA5 213353_at 5 5
ABCB1 209993_at 4 4
ABHD5 213805_at 5 38.33333333
ABHD5 213935_at 10 38.33333333
ABHD5 218739_at 100 38.33333333
ACSM3 205942_s_at 1 1000.5
ACSM3 210377_at 2000 1000.5
As you can see Column A has a geneID, column B a probe name (which is irrelevant to this task) and C has the 'expression value, which I have given false values here to make the outliers problem a bit more visually extreme. In column D I have filled in the average values using the following excel code:
=AVERAGEIFS(C:C,A:A,A2 )
I placed that in the first cell of column D then dragged it down to fill in all the appropriate values. What I would like to do now is amend this excel function so that it performs this task but takes advantage of the QUARTILE function to only use values in C to calculate an average if:
- The ID in column A for that value matches
- It falls within an upper and lower bound defined by 1.5 times the interquartile range.
I'm just having trouble managing all the nested if() functions and conditionals to do this and thought maybe someone with greater excel experience could offer a solution. Many thanks for any help you can provide!
For the sake of completeness, in case anyone else digs up this post and finds it useful, here are some additional methods I wound up trying besides the use of quartiles shown above:
Taking the Standard deviation and dropping values more than 2* the Standard deviation from the mean (this method may be problematic since the outlier itself affects the mean):
The MAD method (Median Absolute Deviation):
Just taking the median of the values:
All of course implemented with the same block of code except for the changing of the function called by FUN=