Question

Filter data frame with dplyr

2

Entering edit mode

4.3 years ago

luca ▴ 70

Hi there, I would like to filter my dataframe which is made of 5 columns, of which column1 contains gene names, column 2 contains Fold Changes (expressed as logFC), column 3 contains the FDR-adjusted p-value and the other two columns contain other things.

The thing is that my genes can be duplicated in the data.frame, so I would like to remove duplicated values. To remove duplicated values I am sorting by FDR to keep the gene (among the duplicates) that has the lowest FDR, by doing this: convertedata2 = convertedata %>% group_by(Geneid) %>% filter(FDR == min(FDR))

The problem is that some genes can have the same minimum FDR (e.g. if all genes have FDR=1), so they are not filtered.... To remove them, I would like to filter based on the logFC, and I would like to keep the gene that has the highest absolute(logFC). So I thought to change the previous command into this: convertedata2 = convertedata %>% group_by(Geneid) %>% filter(FDR == min(FDR)) %>% filter(logFC == max(abs(logFC))) but the problem is that it doesn't work... I suspect it has to do with the abs function, but I am not sure why and what is going on. Any help is much appreciated!

Thanks Luca

dplyr R filter • 2.0k views

ADD COMMENT • link updated 4.3 years ago by rpolicastro 13k • written 4.3 years ago by luca ▴ 70

score 2 · Answer 1 · 2020-08-04

Here is some example data.

df <- data.frame(Geneid=c("A","A","B","C"), FDR=c(0.01,0.01,0.25,0.025), logFC=rnorm(4,0,3))

> df
  Geneid   FDR     logFC
1      A 0.010  1.970233
2      A 0.010 -2.703701
3      B 0.250  3.957811
4      C 0.025 -2.641965

Here is how you would do the filtering (you were really close).

library("dplyr")

df <- df %>%
  group_by(Geneid) %>%
  filter(FDR == min(FDR) & abs(logFC) == max(abs(logFC))) %>%
  ungroup

> df
# A tibble: 3 x 3
# Groups:   Geneid [3]
  Geneid   FDR logFC
  <chr>  <dbl> <dbl>
1 A      0.01  -2.70
2 B      0.25   3.96
3 C      0.025 -2.64