dplyr and %>% operator
I highly recommend you to look through 'dplyr' package as well as other tools from 'tidyverse'. dplyr is extremely powerful tool for cleaning and summarizing the data. It provides special '%>%' operator which pipes the output of one function into another. I can provide the following illustration to this operator:
# x - vector or other object with data. F, G, H -- functions that you would like to apply to x.
# Here is base approach, e.g. nested application:
y <- F(G(H(x)))
# Here is piping approach:
x %>%
H() %>%
G() %>%
F() -> y
Both approaches give absolutely the same result but piping makes the code syntax closer to the human way of thinking: "I take x, put it in H function, then the result goes to the G function, afterwards we apply F function and finally we put the value to the y variable".
%>% operator is defined in 'magrittr' package but it is extremely powerful for dataframe operations defined in 'dplyr'.
Go back to your question about the solution for columns. Here is the code:
library(dplyr)
#Get the logic index, where TRUE corresponding to the columns with all NAs
onlyNAcolumns_idx <- data %>%
is.na() %>%
apply(MARGIN = 2, FUN = all)
# Get the table without columns containing only NAs:
data[,!onlyNAcolumns_idx]
How to read the code which calculates onlyNAcolumns_idx:
- We take 'data' object
- Then we apply is.na() function. The result is the data.frame object of the same size as 'data'; it contains TRUEs and FALSEs. You have TRUEs for NA values in original 'data'.
- We apply the function 'all()' for every column. This function returns TRUE if it is applied to vector with all TRUEs
That's it! the length of 'onlyNAcolumns_idx' is the same as the number of columns in data:
#You get true if you execute this:
length(onlyNAcolumns_idx) == ncol(data)
Finally, you just make logic subsetting of the dataframe.
When you get in touch with dplyr the R-life will become easier. One just needs some time to adapt.
Example
x <- matrix(seq(25), ncol = 5)
x[2,] <- NA
x[,4] <- NA
x[4,2] <- NA
x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 NA 21
[2,] NA NA NA NA NA
[3,] 3 8 13 NA 23
[4,] 4 NA 14 NA 24
[5,] 5 10 15 NA 25
# 2nd row and 4th column contain all NAs.
# We're gonna to waste them out and retain everyother row and column.
# Treat columns:
onlyNAcolumns_idx <- x %>%
is.na() %>%
apply(MARGIN = 2, FUN = all)
onlyNAcolumns_idx
[1] FALSE FALSE FALSE TRUE FALSE
( y <- x[,!onlyNAcolumns_idx] ) # NA column disappered
[,1] [,2] [,3] [,4]
[1,] 1 6 11 21
[2,] NA NA NA NA
[3,] 3 8 13 23
[4,] 4 NA 14 24
[5,] 5 10 15 25
# Now it's time to process the rows:
onlyNArows_idx <- y %>%
is.na() %>%
apply(MARGIN = 1, FUN = all)
onlyNArows_idx
[1] FALSE TRUE FALSE FALSE FALSE
y[!onlyNArows_idx,] # NA row disappeared
[,1] [,2] [,3] [,4]
[1,] 1 6 11 21
[2,] 3 8 13 23
[3,] 4 NA 14 24
[4,] 5 10 15 25
I suggest adding the commands you've tried. For columns confront link and for rows link
In fact it was not so hard to find ;-)
Hi Maciej,
Thank you for your prompt reply. Indeed, not difficult to find but I've tried all the suggestions I could find from the forums. The commands you just sent me, na.omit, na.rm = TRUE, x[complete.cases(x), ] and many more which I didn't save because they didn't provide me with the desired result.
Thanks again!