Question

Binding several large matrices by column

0

Entering edit mode

9.7 years ago

yogacacarolom ▴ 70

I truly know that 'large matrix issue' is a recurrent topic here, but I would like to explain minutely my specific problem regarding large matrices.

Strictly speaking, I want to cbind several large matrices with a specific name pattern in R. The below code shows my best try until this point.

First lets produce files to mimetize my real matrices:

# df1
df1 <- '######## infx infx infx
######## infx infx infx
probeset_id sample1 sample2 sample3
PR01           1       2       0
PR02           -1      2       0
PR03            2      1       1
PR04           1       2       1
PR05           2       0       1'
df1 <- read.table(text=df1, header=T, skip=2)
write.table(df1, "df1.txt", col.names=T, row.names=F, quote=F, sep="\t")

# df2
df2 <- '######## infx infx infx
######## infx infx infx
probeset_id sample4 sample5 sample6
PR01           2       2       1
PR02           2      -1       0
PR03            2      1       1
PR04           1       2       1
PR05           0       0       1'
df2 <- read.table(text=df2, header=T, skip=2)
write.table(df2, "df2.txt", col.names=T, row.names=F, quote=F, sep="\t")

# dfn
dfn <- '######## infx infx infx
######## infx infx infx
probeset_id samplen1 samplen2 samplen3
PR01           2       -1       1
PR02           1      -1       0
PR03            2      1       1
PR04           1       2       -1
PR05           0       2       1'
dfn <- read.table(text=dfn, header=T, skip=2)
write.table(dfn, "dfn.txt", col.names=T, row.names=F, quote=F, sep="\t")

Then import it to R and write as my expected output file:

### Importing and excluding duplicated 'probeset_id' collumn
calls = list.files(pattern="*.txt")
library(data.table)
calls = lapply(calls, fread, header=T)
mycalls <- as.data.frame(calls)
probenc <- as.data.frame(mycalls[,1])
mycalls <- mycalls[, -grep("probe", colnames(mycalls))]
output <- cbind(probenc, mycalls)
names(output)[1] <- "probeset_id"
write.table(output, "output.txt", col.names=T, row.names=F, quote=F, sep="\t")

How the output looks like:

head(output)
  probeset_id sample1 sample2 sample3 sample4 sample5 sample6 samplen1 samplen2 samplen3
1        PR01       1       2       0       2       2       1        2       -1        1
2        PR02      -1       2       0       2      -1       0        1       -1        0
3        PR03       2       1       1       2       1       1        2        1        1
4        PR04       1       2       1       1       2       1        1        2       -1
5        PR05       2       0       1       0       0       1        0        2        1

This code works perfectly for what I want to do, however, I face the known R memory limitation using my real data (more than 30 "df" objects with ~1.3GB or/and 600k rows by 100 columns each).

I read about a very interesting SQL approach (http://stackoverflow.com/questions/4761073/r-how-to-rbind-two-huge-data-frames-without-running-out-of-memory) but I am inexperienced in SQL and could not find a way to adapt it to my case.

Cheers

big-data R matrix • 3.5k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.7 years ago by yogacacarolom ▴ 70

0

Entering edit mode

A big part of bioinformatics, or computer science in general, is finding the right way to look at a problem such that it is solvable with the tools we have. You've seen, the simple solution in R starts to fail when the matrices get too large. No matter how much system ram you buy, we'll always get a bigger datafile sooner or later.

Below, Pierre has suggested a file concatenation outside of R. I think it will solve your issue today. But if those matrices are taking up half of your storage capacity, then making a duplicate in the new form is not possible. Or if this is a repetitive task, that slow intermediate step might be a showstopper. As I said before, you'll always get a bigger data file. What you have to do is outsmart the problem. Think about what you're trying to do with the data and it might be possible to skip this large-matrix stage entirely.

If you want to know how many elements are greater than X, you could do that on each submatrix separately. If you want to know the row-means, you can also do those separately. I see these are genotypes, so you could stratify rows by chromosome, therefore reducing the matrix size by about 20x. Maybe you're doing an imputation that needs all columns together, but that's really a local process that could be windowed by every thousand or so rows. If you're trying to do a PCA on the whole population, try a few thousand randomly selected rows of each matrix (to get it small enough to cbind in R). The final outcome should be substantially the same in the first few dominant PCs. That sort of thing.

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 9.7 years ago by karl.stamm 4.1k

Ram · Answer 1 · 2015-09-22

2

Entering edit mode

9.7 years ago

Pierre Lindenbaum 166k

Just a simple bash loop with linux join?

$ sort  -k1,1n  file1.txt > sorted.txt &&  for F in file2 file3; do sort  -k1,1n ${F}.txt | join -1 1 -2 1 sorted.txt -  > tmp && mv tmp sorted.txt;  done  && cat sorted.txt

PR01 2 2 1 2 2 1 2 -1 1
PR02 2 -1 0 2 -1 0 1 -1 0
PR03 2 1 1 2 1 1 2 1 1
PR04 1 2 1 1 2 1 1 2 -1
PR05 0 0 1 0 0 1 0 2 1
probeset_id sample1 sample2 sample3 sample4 sample5 sample6 samplen1 samplen2 samplen3

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

To clarify on your Bash solution Pierre, yogacacarolom would modify the "file2 file3" portion "for F in file2 file3" to use wildcards to match all files (except file1), so she could join an arbitrarily large number of files, correct?

Also, that "cat sorted.txt" is solely to print output at the end, so if one doesn't want to print out the entire file, omit the end ("&& cat sorted.txt"), right?

ADD REPLY • link 9.7 years ago by Joseph Pearson ▴ 480

0

Entering edit mode

'yes' and 'yes' :-)

ADD REPLY • link 9.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hi Pierre, This works perfect and thanks for the solution. I am running in to a similar issue as mentioned in this post. It is just that I want to match the files based on chr and bp position of different files and create the matrix. So I took the liberty of tweaking your solution and here is the syntax. in the input file I am having the chr number in the first field and the bp position in the second field and the values in the third field

sort -n -k1.4n -k2,2n DNase_E003.hits.epigen.input > sorted.txt && for F in *.hits.epigen.input; do sort -n -k1.4n -k2,2n ${F} | join -1 1 -2 1 sorted.txt - >> tmp && mv tmp sorted.txt; done;

Running this script gives me the output,but printing also all the fields (chr, bp_pos) from each of the input file. Is there a way to omit this and also the syntax prompts for whether the sorted.txt can be overwritten repeatedly. So for a large list of files I am getting a larger list of prompts. Will be great to hear your input on this.

ADD REPLY • link 8.5 years ago by muralinmars ▴ 100