Hi!
As a biologist by training that switched to bioinformatics I first learned to program in R. Then I learned a few bits of unix and python for some really specific tasks. So far I manage to quickly deal with all the tasks I was asked to do using only those.
Unfortunately, I am now facing a new issue: Really big datasets. By this I mean vectors of hundreds of millions of entries or dataframes that take several minutes to be loaded in R and sometimes hours just to tidy.
So my question is the following. For more advanced bioinformaticians or programmers: How do you usually deal with really large datasets?
- What are the "best/most useful" langages for those tasks? ==> R is probably not the best option.
- What are the common tactics used? (Hash tables, ...)
- Do you have some tricks to share?
- Do you have any books/MOOCS to advices?
Since my background is not programming I am really looking for general advices related to this question ;-).
Thanks in advance for your answers!
Thank you all for your answers! They were really helpful!