dplyr and %>% operator

Question

Remove columns and rows which have only NAs without deleting all the data

0

Entering edit mode

7.8 years ago

ElenaL • 0

Hi everyone,

I am not entirely new in R but somehow the following problem bugs me for days now.

I have a pairwise correlation matrix of SNPs and some columns and rows returned only NAs. When trying to omit or in any way delete these rows or columns, all the data is deleted.

Do you have any ideas of how I can do this without deleting all the data and just getting rid of the NA columns and rows?

Any help would be greatly appreciated.

Thank you in advance, Elena

missing data R correlation SNPs • 42k views

ADD COMMENT • link updated 6.4 years ago by slegotsky • 0 • written 7.8 years ago by ElenaL • 0

2

Entering edit mode

I suggest adding the commands you've tried. For columns confront link and for rows link

In fact it was not so hard to find ;-)

ADD REPLY • link 7.8 years ago by boczniak767 ▴ 870

0

Entering edit mode

Hi Maciej,

Thank you for your prompt reply. Indeed, not difficult to find but I've tried all the suggestions I could find from the forums. The commands you just sent me, na.omit, na.rm = TRUE, x[complete.cases(x), ] and many more which I didn't save because they didn't provide me with the desired result.

Thanks again!

ADD REPLY • link 7.8 years ago by ElenaL • 0

1

Entering edit mode

7.8 years ago

Noushin N ▴ 620

Assuming that you want to remove any rows or columns containing NA values, you can try the following approach.

x = matrix(seq(49), ncol = 7)
x[2,5]= NA
x
       [,1] [,2] [,3] [,4] [,5] [,6] [,7]
 [1,]    1    8   15   22   29   36   43
 [2,]    2    9   16   23   NA   37   44
 [3,]    3   10   17   24   31   38   45
 [4,]    4   11   18   25   32   39   46
 [5,]    5   12   19   26   33   40   47
 [6,]    6   13   20   27   34   41   48
 [7,]    7   14   21   28   35   42   49

keep.cols = which(apply(!is.na(x), 2, all))
keep.rows = which(apply(!is.na(x), 1, all))
clean.x = x[keep.rows,keep.cols]
clean.x
        [,1] [,2] [,3] [,4] [,5] [,6]
 [1,]    1    8   15   22   36   43
 [2,]    3   10   17   24   38   45
 [3,]    4   11   18   25   39   46
 [4,]    5   12   19   26   40   47
 [5,]    6   13   20   27   41   48
 [6,]    7   14   21   28   42   49

ADD COMMENT • link 7.8 years ago by Noushin N ▴ 620

0

Entering edit mode

No need for apply, which, use rowSums and colSums instead.

ADD REPLY • link 6.4 years ago by zx8754 12k

0

Entering edit mode

7.8 years ago

boczniak767 ▴ 870

So, for a matrix x containing some rows and columns with all NAs try: x_no_na_col=x[,apply(!is.na(x),2,any)] for columns
and ind <- apply(x_no_na_col, 1, function(x_no_na_col) all( is.na(x_no_na_col)))
then x_no_na=x_no_na_col[!ind,] for rows

ADD COMMENT • link 7.8 years ago by boczniak767 ▴ 870

0

Entering edit mode

No need for apply, which, use rowSums and colSums instead.

ADD REPLY • link 6.4 years ago by zx8754 12k

0

Entering edit mode

7.8 years ago

theobroma22 ★ 1.2k

Why remove them when instead you should just change the NA value to a zero.

X[is.na(X)] <- 0

ADD COMMENT • link 7.8 years ago by theobroma22 ★ 1.2k

1

Entering edit mode

Changing an NA to 0 is most of the times risky, since it might be interpreted as a real 0- which according to the data, might have another meaning than the NA-not measured -not available. This could change the further analysis outcome and interpretation !

ADD REPLY • link 7.0 years ago by mariakondili13 ▴ 30

0

Entering edit mode

This is not required by OP.

ADD REPLY • link 6.4 years ago by zx8754 12k

0

Entering edit mode

6.4 years ago

slegotsky • 0

Here is a bit clearer solution:

library(dplyr)
#Get the logic index, where TRUE corresponding to the rows with all NAs
onlyNAlines_idx <- data %>%
  is.na() %>%
  apply(MARGIN = 1, FUN = all)

# Get the table without rows containing only NAs:
data[!onlyNAlines_idx,]

ADD COMMENT • link 6.4 years ago by slegotsky • 0

0

Entering edit mode

How about removing columns? This is not very clear, no need for extra packages.

ADD REPLY • link 6.4 years ago by zx8754 12k

0

Entering edit mode

6.4 years ago

slegotsky • 0

dplyr and %>% operator

I highly recommend you to look through 'dplyr' package as well as other tools from 'tidyverse'. dplyr is extremely powerful tool for cleaning and summarizing the data. It provides special '%>%' operator which pipes the output of one function into another. I can provide the following illustration to this operator:

# x - vector or other object with data. F, G, H -- functions that you would like to apply to x.
# Here is base approach, e.g. nested application:
y <- F(G(H(x)))

# Here is piping approach:
x %>%
  H() %>%
  G() %>%
  F() -> y

Both approaches give absolutely the same result but piping makes the code syntax closer to the human way of thinking: "I take x, put it in H function, then the result goes to the G function, afterwards we apply F function and finally we put the value to the y variable". %>% operator is defined in 'magrittr' package but it is extremely powerful for dataframe operations defined in 'dplyr'.

Go back to your question about the solution for columns. Here is the code:

library(dplyr)
#Get the logic index, where TRUE corresponding to the columns with all NAs
onlyNAcolumns_idx <- data %>%
   is.na() %>%
   apply(MARGIN = 2, FUN = all)

# Get the table without columns containing only NAs:
data[,!onlyNAcolumns_idx]

How to read the code which calculates onlyNAcolumns_idx:

We take 'data' object
Then we apply is.na() function. The result is the data.frame object of the same size as 'data'; it contains TRUEs and FALSEs. You have TRUEs for NA values in original 'data'.
We apply the function 'all()' for every column. This function returns TRUE if it is applied to vector with all TRUEs

That's it! the length of 'onlyNAcolumns_idx' is the same as the number of columns in data:

#You get true if you execute this:
length(onlyNAcolumns_idx) == ncol(data)

Finally, you just make logic subsetting of the dataframe. When you get in touch with dplyr the R-life will become easier. One just needs some time to adapt.

Example

x <-  matrix(seq(25), ncol = 5)
x[2,] <- NA
x[,4] <- NA
x[4,2] <- NA
x
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   NA   21
[2,]   NA   NA   NA   NA   NA
[3,]    3    8   13   NA   23
[4,]    4   NA   14   NA   24
[5,]    5   10   15   NA   25

# 2nd row and 4th column contain all NAs.
# We're gonna to waste them out and retain everyother row and column.
# Treat columns:

onlyNAcolumns_idx <- x %>%
  is.na() %>%
  apply(MARGIN = 2, FUN = all)
onlyNAcolumns_idx
[1] FALSE FALSE FALSE  TRUE FALSE

( y <- x[,!onlyNAcolumns_idx] ) # NA column disappered
         [,1] [,2] [,3] [,4]
    [1,]    1    6   11   21
    [2,]   NA   NA   NA   NA
    [3,]    3    8   13   23
    [4,]    4   NA   14   24
    [5,]    5   10   15   25

# Now it's time to process the rows:
onlyNArows_idx <- y %>%
  is.na() %>%
  apply(MARGIN = 1, FUN = all)
onlyNArows_idx
[1] FALSE  TRUE FALSE FALSE FALSE

y[!onlyNArows_idx,] # NA row disappeared
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   21
[2,]    3    8   13   23
[3,]    4   NA   14   24
[4,]    5   10   15   25

ADD COMMENT • link 6.4 years ago by slegotsky • 0

3

Entering edit mode

Nothing against dplyr, but in this post it is not used at all:

%>% pipe is from magrittr package
is.na, apply, all functions are from base R

Simpler translation of your code would be:

onlyNAcolumns_idx  <- apply( is.na(x), 2, all ) # no pipes no dplyr

or better:

onlyNAcolumns_idx  <- colSums( !is.na(x) ) == 0 # avoid apply loop

Benchmarking on bigger matrix:

#bigger matrix
x <-  matrix(seq(10000), ncol = 100)
x[2,] <- NA
x[,4] <- NA
x[4,2] <- NA
x

microbenchmark::microbenchmark(
  x1 = apply( is.na(x), 2, all ),
  x2 = colSums( !is.na(x) ) == 0
)
# Unit: microseconds
# expr     min       lq      mean  median       uq      max neval
#   x1 169.058 173.9035 194.67840 176.042 179.7485 1569.411   100
#   x2  18.816  21.9515  38.45843  22.807  23.3770 1420.595   100

Base solution is 5x faster.

ADD REPLY • link 6.4 years ago by zx8754 12k

zx8754 · Accepted Answer · 2017-07-27

3

Entering edit mode

7.8 years ago

cpad0112 21k

x = matrix(seq(16), ncol = 4)
x[2,3]= NA

> x
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   NA   14
[3,]    3    7   11   15
[4,]    4    8   12   16

To remove rows and columns with NA:

> x[complete.cases(x),complete.cases(t(x))]
     [,1] [,2] [,3]
[1,]    1    5   13
[2,]    3    7   15
[3,]    4    8   16

ADD COMMENT • link 7.8 years ago by cpad0112 21k

0

Entering edit mode

how about removing rows with all NA but keep any row that contain one observation along with NA?

ADD REPLY • link 6.4 years ago by 1769mkc ★ 1.3k

5

Entering edit mode

output:

> x[rowSums( is.na(x) ) <=1, ]

     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   NA   14
[3,]    3    7   11   15

input:

> x
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   NA   14
[3,]    3    7   11   15
[4,]   NA   NA   NA   NA

ADD REPLY • link updated 6.4 years ago by zx8754 12k • written 6.4 years ago by cpad0112 21k

0

Entering edit mode

this is simple as i was thinking of how to loop through the rows and columns which i haven;t figured out yet...

this error "Error: unexpected ')' in "x[rowSumsis.na(x))"" i guess rowSumis.na not a function..perhaps

i guess that was a typo...this works

" x[rowSumsis.na(x))<=1,]"

its working but one more info instead of numbers i have mutation in genes of each patient so right now i m getting an empty data frame

ADD REPLY • link 6.4 years ago by 1769mkc ★ 1.3k

1

Entering edit mode

rowSumsis.na function is not available in recent R versions (3.5). use x[rowSums( is.na(x) ) <=1, ] in place of x[rowSumsis.na(x))<=1,] . Post example data in a new post : krushnach80

ADD REPLY • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

data example

ADD REPLY • link 6.4 years ago by 1769mkc ★ 1.3k

0

Entering edit mode

what's the expected from this data? I see mostly NAs krushnach80

ADD REPLY • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

okay..so I want to remove all the genes which doesn;t contain any mutation at all in any samples, as an example NPM1 contains mutation is 50 patient so i want to keep that row, in other words keep that rows[genes] which contain at least one mutation in any patient sample

ADD REPLY • link 6.4 years ago by 1769mkc ★ 1.3k

1

Entering edit mode

use shared mutation.txt without any modifications to the file and with R 3.5:

df1=read.csv("MUTATION.txt", header = T, stringsAsFactors = F, strip.white = T, sep=",")
df2=df1[rowSums(!is.na(df1[,-1]))>=1,]

krushnach80

ADD REPLY • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

this is the mutation % data for example you can have a look at the data link below...one of the gene LSM12 1.20% is not present in df2 i was trying something similar but im missing these ...what is wrong not sure...

ADD REPLY • link 6.4 years ago by 1769mkc ★ 1.3k

0

Entering edit mode

your data on lsm12 from mutation.txt: krushnach80

LSM12,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA

ADD REPLY • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

okay strange ...I will get back .let me see the original file again..may be some error i have made..

ADD REPLY • link 6.4 years ago by 1769mkc ★ 1.3k