Question

Excel: using quartiles & averageifs to combine probe values without outliers

0

Entering edit mode

4.8 years ago

RNAseqer ▴ 280

Hello all, I am looking for an excel wizard to help me with a messy task: I would like to take the average of values in one column for all those cells sharing an ID in another column BUT with the added stipulation that outliers are excluded based on the use of quartiles.

Here is an example of the spreadsheet:

A       B           C      D     
ABCA5   213353_at   5      5
ABCB1   209993_at   4      4
ABHD5   213805_at   5      38.33333333
ABHD5   213935_at   10     38.33333333
ABHD5   218739_at   100    38.33333333
ACSM3   205942_s_at 1      1000.5
ACSM3   210377_at   2000   1000.5

As you can see Column A has a geneID, column B a probe name (which is irrelevant to this task) and C has the 'expression value, which I have given false values here to make the outliers problem a bit more visually extreme. In column D I have filled in the average values using the following excel code:

=AVERAGEIFS(C:C,A:A,A2 )

I placed that in the first cell of column D then dragged it down to fill in all the appropriate values. What I would like to do now is amend this excel function so that it performs this task but takes advantage of the QUARTILE function to only use values in C to calculate an average if:

The ID in column A for that value matches
It falls within an upper and lower bound defined by 1.5 times the interquartile range.

I'm just having trouble managing all the nested if() functions and conditionals to do this and thought maybe someone with greater excel experience could offer a solution. Many thanks for any help you can provide!

excel outliers quartile averageifs matching • 1.2k views

ADD COMMENT • link updated 4.8 years ago by swbarnes2 14k • written 4.8 years ago by RNAseqer ▴ 280

1

Entering edit mode

4.8 years ago

swbarnes2 14k

Hello all, I am looking for an excel wizard to help me with a messy task:

You know what Excel does to gene names, right? Give it a try with Mar3 and Sep6.

The response I'd like to give is not quite safe for work, so I'll just advise you to google "bioinformatics clippy excel", and take note of the first two entries under Images.

ADD COMMENT • link 4.8 years ago by swbarnes2 14k

score 2 · Accepted Answer · 2020-02-19

Yes... I HAVE had run ins with that scoundrel Clippy before, and am well aware of the problems with excel. Taking your gentle hint I have gone and tried to solve the problem in R. Here is what I've come up with:

Adapting some R code from the links at the bottom I have done the following:

> toy
      A           B    C    D
1 ABCA5   213353_at    5    6
2 ABCB1   209993_at    4    4
3 ABHD5   213805_at    5    6
4 ABHD5   213935_at   10   11
5 ABHD5   213666_at   12   12
6 ABHD5   218739_at 1000   75
7 ACSM3 205942_s_at    1 1000
8 ACSM3   210377_at 2000 1500

Is my data

myFunction3 = function(x, na.rm = TRUE, ...) {
    qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
    H <- 1.5 * IQR(x, na.rm = na.rm)
    y <- x
    y[x < (qnt[1] - H)] <- NA
    y[x > (qnt[2] + H)] <- NA
    mean(y, na.rm = TRUE)
}

Is my function, called by:

newtable <- aggregate(toy[, -c(1,2)],
                      by = list(A = toy$A),          #Here A could/would be the column header for the probe labels
                      FUN = myFunction3,
                      na.rm = TRUE)
newtable

Which results in:

> newtable   
      A      C           D
1 ABCA5    5.0    6.000000
2 ABCB1    4.0    4.000000
3 ABHD5    9.0    9.666667
4 ACSM3 1000.5 1250.000000

Do you see any errors or anything I have missed? If so please let me know. One limitation I'd I might need to overcome is that I'm going to have a fair number of genes with only 3 probes I think, so the outlier criterion I'm using here might need to be changed in the cases where there are say expression values like: 10, 11, 320. Also, in the case of column C for ACSM# there are only two values 1 and 2000 which are obviously wildly different and maybe need to be excluded? But this approach does not detect that potential issue.

From stack overflow: https://stackoverflow.com/questions/4787332/how-to-remove-outliers-from-a-dataset

From Biostars (Is this appropriate linking within the site?)

using R aggregate command/combining affy probes into gene IDs