Entering edit mode
4.8 years ago
castravete2712
•
0
Hello there,
I'm struggling loading/importing my large count Matrix on Rstudio in order to analyze it. It's a quite medium sized data (3 Giga) but R is crashing every time and my PC want to seppuku itself each time.
So, importing by only making a read.table won't work. I tried stocking it as a big.matrix file, it won't work either, R crashes again.
What can I do? I can't find any nice tutorial for this kind of problem.
Is it crashing due to running out of RAM? Have you tried a sparse matrix?
Yep, the RAM can't keep up.
I was thinking of that but I can't manage to read the file directly into a sparseMatrix, avoiding the read.table step. read.matrix maybe?
How about doing that sequentially, like in chunks of 10%?
By the way, don't bother with
read.table
, it is super slow. Use for example (among many good options)data.table::fread()
orreadr::readr()
. Speed gains are notable.Might be a job for {disk.frame} https://github.com/xiaodaigh/disk.frame
Do you mean "load/import" a big file?
Yeah, sorry, I was indeed meaning to say to import or load data
I struggle to see how this is related to bioinformatics, or why it has attracted so many answers. Loads of questions get killed for asking something about biology and maybe tangentially related to bioinformatics. I don't see how this question is related to either.
Dealing with large data sets has become a more common issue although it is not specific to bioinformatics. However, for bioinformatics data types, there may exist specific tools. Here we're dealing with a count matrix and although replies currently suggest generic solutions, maybe someone has a more specific solution for count matrices as part of their analysis pipeline that they can share.
Agreed. The single-cell packages are starting to output counts in sparse matrices inside hdf5 containers for this reason, so if one could go back a step in OP's workflow there are likely some tweaks that could be made there to make life easier.
Maybe I am missing your point. Do you see anything in the question that implies biology or bioinformatics application of what this poster is trying to do?
My point was that lots of posters are turned away even though they sometimes have legitimate biology question that may be related to bioinformatics. To me, that is closer to the intended purpose of this site than the current post.
Your point is valid, but as it does not add to the content of this thread I suggest we discuss things like that in our Slack, which you are invited to join:
biostar.slack.com: Chat for the biostars community -- [ feel free to join ]
Well, as it wasn't really important to say why I was needing it, I didn't mention it. But I need to find what I have to do in order to import this complex and large data in R because I need to analyze a large count Matrix issued of single-cell sequencing (split-seq if I want to be precise).
The count matrices issued of the pipelines that analyze the single-cell raw data are huge and in conclusion, R has troubles working on them and needs a lot of RAM.
So, I was just trying to ask around as I can't really find the right method that will help me use R with such large files and I want to do it properly