I have a very large binary matrix, stored as a big.matrix to conserve memory (it is over 2 gb otherwise - 5 million columns and 100 rows).
r <- 100
c <- 10000
m4 <- matrix(sample(0:1,r*c, replace=TRUE),r,c)
m4 <- cbind(m4, 1)
m4 <- bigmemory::as.big.matrix(m4)
I need to remove every column which has only one unique value (in this case, only 0s or only 1s). Because of the number of columns, I want to be able to do this in parallel.
How can I accomplish this while keeping the data compressed as a big.matrix? I can convert it into a df and loop over the columns looking for the number of unique values, but this takes too much RAM.
Thanks!
EDIT: It is bioinformatics as each column is actually a protein subsequence. I am running fisher's exact to select important features, but before that, I must remove features that are present in all samples.
This is purely an R question. How is it bioinformatics?
Hello jackarnestad!
We believe that this post does not fit the main topic of this site.
Please tell us how this is related to bioinformatics and we will reopen the question.
For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.
If you disagree please tell us why in a reply below, we'll be happy to talk about it.
Cheers!
I addressed the bioinformatics aspect in my edit. Thanks!
Thanks for clarifying. This is indeed a question applied to bioinformatics, but R questions like this might get a quicker answer at bioconductor support or stackoverflow. But you can still be lucky that someone here can help you, so let's wait a bit before cross posting...
Could you include the package where big.matrix is defined in your code
Added it to the code,
bigmemory