Question

Subset top 5% values of data frames stored in list

1

Entering edit mode

6.2 years ago

paolo002 ▴ 160

Hi This is probably a question for stack overflow but I am posting it here because there I am not getting much replies, apologies for this.

I have a 24 data frames with different number of rows and with various columns (for instance column SNPs ID and other columns with corresponding values for each SNPs) and I stored those data frames inside a list. I am doing various operations on the data frames at the same time. For instance if I want to order in decreasing manner a column in all the data frames inside the list I do:

myfiles_ordered<-lapply(myfiles, function(x) { x[ order(x$column_name_to_order, decreasing=T),]})

Now, after ordering that column I would like to take the top 5 % of the values of it. I was thinking I can subset all the data frames based on their specific row number multiplied by 0.05 and I wrote something like this:

myfiles_top5<-lapply(myfiles_ordered, function(x) {x[1:nrow(x)*5/100,]})

However, it does not seems to work. Any help highly appreciated, thanks.

R subset • 1.8k views

ADD COMMENT • link updated 6.2 years ago by zx8754 12k • written 6.2 years ago by paolo002 ▴ 160

0

Entering edit mode

there I am not getting much replies

Could you add the link to StackOverflow post?

ADD REPLY • link 6.2 years ago by zx8754 12k

0

Entering edit mode

https://stackoverflow.com/questions/53724885/subset-multiple-data-frames-stored-inside-a-list

in any case...this was the link but maybe there I did not explain my problem so well...

ADD REPLY • link 6.2 years ago by paolo002 ▴ 160

score 3 · Accepted Answer · 2018-12-13

Try:

myfiles_top5 <- lapply(myfiles_ordered, function(x) { x[ 1:round(nrow(x)*5/100), ]})

Because we are creating a sequence of 1 to n, then applying 5% for all of them.

1:nrow(mtcars) * 5/100
#  [1] 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80
# [17] 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60

Not what we need...

Instead we need to get 5% then get the sequence, using parenthesis ():

1:(nrow(mtcars) * 5/100)
# [1] 1

Again not ideal as, below both give 1:

1:1.2
# [1] 1
1:1.6
# [1] 1

Whereas we might need 1:2 for 1:1.6, so we use round:

1:round(1.2)
# [1] 1
1:round(1.6)
# [1] 1 2

Update: We can do ordering and subsetting 5% in one go, e.g.:

# Using base
head(mtcars[ order(mtcars$mpg, decreasing = TRUE), ], round(nrow(mtcars) * 5/100))
#                 mpg cyl disp hp drat    wt  qsec vs am gear carb
# Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1
# Fiat 128       32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1

# Using dplyr
library(dplyr)
top_n(mtcars, round(nrow(mtcars) * 5/100), wt = mpg)
#    mpg cyl disp hp drat    wt  qsec vs am gear carb
# 1 32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1
# 2 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1