I truly know that 'large matrix issue' is a recurrent topic here, but I would like to explain minutely my specific problem regarding large matrices.
Strictly speaking, I want to cbind
several large matrices with a specific name pattern in R. The below code shows my best try until this point.
First lets produce files to mimetize my real matrices:
# df1
df1 <- '######## infx infx infx
######## infx infx infx
probeset_id sample1 sample2 sample3
PR01 1 2 0
PR02 -1 2 0
PR03 2 1 1
PR04 1 2 1
PR05 2 0 1'
df1 <- read.table(text=df1, header=T, skip=2)
write.table(df1, "df1.txt", col.names=T, row.names=F, quote=F, sep="\t")
# df2
df2 <- '######## infx infx infx
######## infx infx infx
probeset_id sample4 sample5 sample6
PR01 2 2 1
PR02 2 -1 0
PR03 2 1 1
PR04 1 2 1
PR05 0 0 1'
df2 <- read.table(text=df2, header=T, skip=2)
write.table(df2, "df2.txt", col.names=T, row.names=F, quote=F, sep="\t")
# dfn
dfn <- '######## infx infx infx
######## infx infx infx
probeset_id samplen1 samplen2 samplen3
PR01 2 -1 1
PR02 1 -1 0
PR03 2 1 1
PR04 1 2 -1
PR05 0 2 1'
dfn <- read.table(text=dfn, header=T, skip=2)
write.table(dfn, "dfn.txt", col.names=T, row.names=F, quote=F, sep="\t")
Then import it to R and write as my expected output
file:
### Importing and excluding duplicated 'probeset_id' collumn
calls = list.files(pattern="*.txt")
library(data.table)
calls = lapply(calls, fread, header=T)
mycalls <- as.data.frame(calls)
probenc <- as.data.frame(mycalls[,1])
mycalls <- mycalls[, -grep("probe", colnames(mycalls))]
output <- cbind(probenc, mycalls)
names(output)[1] <- "probeset_id"
write.table(output, "output.txt", col.names=T, row.names=F, quote=F, sep="\t")
How the output looks like:
head(output)
probeset_id sample1 sample2 sample3 sample4 sample5 sample6 samplen1 samplen2 samplen3
1 PR01 1 2 0 2 2 1 2 -1 1
2 PR02 -1 2 0 2 -1 0 1 -1 0
3 PR03 2 1 1 2 1 1 2 1 1
4 PR04 1 2 1 1 2 1 1 2 -1
5 PR05 2 0 1 0 0 1 0 2 1
This code works perfectly for what I want to do, however, I face the known R memory limitation using my real data (more than 30 "df
" objects with ~1.3GB or/and 600k rows by 100 columns each).
I read about a very interesting SQL approach (http://stackoverflow.com/questions/4761073/r-how-to-rbind-two-huge-data-frames-without-running-out-of-memory) but I am inexperienced in SQL and could not find a way to adapt it to my case.
Cheers
A big part of bioinformatics, or computer science in general, is finding the right way to look at a problem such that it is solvable with the tools we have. You've seen, the simple solution in R starts to fail when the matrices get too large. No matter how much system ram you buy, we'll always get a bigger datafile sooner or later.
Below, Pierre has suggested a file concatenation outside of R. I think it will solve your issue today. But if those matrices are taking up half of your storage capacity, then making a duplicate in the new form is not possible. Or if this is a repetitive task, that slow intermediate step might be a showstopper. As I said before, you'll always get a bigger data file. What you have to do is outsmart the problem. Think about what you're trying to do with the data and it might be possible to skip this large-matrix stage entirely.
If you want to know how many elements are greater than X, you could do that on each submatrix separately. If you want to know the row-means, you can also do those separately. I see these are genotypes, so you could stratify rows by chromosome, therefore reducing the matrix size by about 20x. Maybe you're doing an imputation that needs all columns together, but that's really a local process that could be windowed by every thousand or so rows. If you're trying to do a PCA on the whole population, try a few thousand randomly selected rows of each matrix (to get it small enough to cbind in R). The final outcome should be substantially the same in the first few dominant PCs. That sort of thing.