Remove duplicate genes with lower significance in microarray data analysis
2
0
Entering edit mode
5.3 years ago
Gene_MMP8 ▴ 240

I have performed microarray data analysis using limma in r and I have a list of DEGs. Now I have some repetitions in gene symbols and I want to keep the genes with the highest significance(adjusted p value). How can I do that in R?

I want to keep only the unique genes with highest significance in case of duplicates. Below is the code that I am trying. tT is the DEG table above. This is only half of the code. I am trying to loop through all the names and if I find a repetition then compare that with the other duplicates.

for(i in tT$SYMBOL){ if(length(which(tT$SYMBOL==i))>1){
index=tT[which(tT$SYMBOL[-c(i),7]==i),]  }

Really need some help. Thanks

R limma microarray RNA-Seq • 3.7k views
ADD COMMENT
0
Entering edit mode

You're working on:

  • limma
  • differential expression
  • microarray

Yet the only tag used is R. Why is that?

ADD REPLY
0
Entering edit mode

In other words we want "group by min", see related StackOverflow post:

ADD REPLY
0
Entering edit mode

Hi banerjeeshayantan,

I have same problem. Did you fix it and how? Thanks

ADD REPLY
2
Entering edit mode
5.3 years ago
AB ▴ 360

Order your dataframe by geneid and pvalue and remove duplicated values

tT =  tT[order(tT$SYMBOL,tT$p.val),]
new_tT = tT[ !duplicated(tT$SYMBOL), ]
ADD COMMENT
0
Entering edit mode
5.3 years ago
Chirag Parsania ★ 2.0k

See the toy example below.

library(tidyverse)

## cartoon expression data which has duplicated values in column 1 
set.seed(32323)
expr_data <- tibble(gene_id = sample(LETTERS[1:5] , 10 , replace = T) , expr =  rnorm(10 ,mean = 10) ) %>% arrange(gene_id)

expr_data
#> # A tibble: 10 x 2
#>    gene_id  expr
#>    <chr>   <dbl>
#>  1 A        9.39
#>  2 B        9.43
#>  3 C        9.52
#>  4 C        9.80
#>  5 C       11.8 
#>  6 D        9.08
#>  7 D        8.76
#>  8 D        9.59
#>  9 E       11.4 
#> 10 E        9.40

## C, D and E are duplicated in column 1. 

## if duplicate in column 1 get the observation which has highest in column 2 

expr_data %>% 
        group_by(gene_id) %>%  ## group by id column 
        dplyr::arrange(desc(expr)) %>% ## arrange each group high to low
        slice(1) ## get first row from each group
#> # A tibble: 5 x 2
#> # Groups:   gene_id [5]
#>   gene_id  expr
#>   <chr>   <dbl>
#> 1 A        9.39
#> 2 B        9.43
#> 3 C       11.8 
#> 4 D        9.59
#> 5 E       11.4

Created on 2019-08-04 by the reprex package (v0.3.0)

ADD COMMENT
0
Entering edit mode

I think a better logic would be group by followed by max instead of sort + slice(1). What do you think?

ADD REPLY
1
Entering edit mode

Yes, true. More readable and less code. Thanks :)

======= Edit

However, with max, if there is tie all matching rows will be returned ... see the example

iris  %>% as_tibble() %>% group_by(Species) %>% filter(Petal.Width == max(Petal.Width))
# A tibble: 5 x 5
# Groups:   Species [3]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
         <dbl>       <dbl>        <dbl>       <dbl> <fct>     
1          5           3.5          1.6         0.6 setosa    
2          5.9         3.2          4.8         1.8 versicolor
3          6.3         3.3          6           2.5 virginica 
4          7.2         3.6          6.1         2.5 virginica 
5          6.7         3.3          5.7         2.5 virginica
ADD REPLY
0
Entering edit mode

Fair enough. Thanks for alerting to the use case for slice(1) :-)

ADD REPLY

Login before adding your answer.

Traffic: 2667 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6