Question

Forum:Good practices while dealing with very large datasets

11

Entering edit mode

7.5 years ago

VHahaut ★ 1.2k

Hi!

As a biologist by training that switched to bioinformatics I first learned to program in R. Then I learned a few bits of unix and python for some really specific tasks. So far I manage to quickly deal with all the tasks I was asked to do using only those.

Unfortunately, I am now facing a new issue: Really big datasets. By this I mean vectors of hundreds of millions of entries or dataframes that take several minutes to be loaded in R and sometimes hours just to tidy.

So my question is the following. For more advanced bioinformaticians or programmers: How do you usually deal with really large datasets?

What are the "best/most useful" langages for those tasks? ==> R is probably not the best option.
What are the common tactics used? (Hash tables, ...)
Do you have some tricks to share?
Do you have any books/MOOCS to advices?

Since my background is not programming I am really looking for general advices related to this question ;-).

Thanks in advance for your answers!

Big-Data Programming • 7.0k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 7.5 years ago by VHahaut ★ 1.2k

0

Entering edit mode

Thank you all for your answers! They were really helpful!

ADD REPLY • link 7.5 years ago by VHahaut ★ 1.2k

score 10 · Answer 1 · 2017-05-07

What are the "best/most useful" langages for those tasks?

linux pipelines / streaming

What are the common tactics used?

"divide and conquer"

divide your tasks by chromosomes/samples/regions, use a parallel workflow manager ( snakemake, make, nextflow etc... ) to run the jobs.

What are the common tactics used? (Hash tables, ...)

for VCF/BAM files, IMHO do not try to insert the result in a database. The files are the database.

score 9 · Answer 2 · 2017-05-07

The most fundamental and most effective skill to work with huge data sets is streaming. With streaming, you only store in memory a tiny part of the entire file, such as one line, summary statistics or selected information matching specified conditions. For example, to compute the mean of column 3, you can use awk like:

zcat mydata.txt.gz | awk '{x+=$3}END{print x/NR}'

In comparison to typical R solutions, this simple command line takes little memory as it doesn't hold the entire file in memory. It is probably faster as it doesn't parse other fields. It also works with gzip'd data files directly. For more complex tasks, you probably need to write a script or start to chain multiple unix commands together. As long as you only care about summary statistics or a small part of the whole file, streaming could work well, easily beating other approaches all around.

Streaming doesn't always work. Sometimes others have developed dedicated tools for specific tasks (e.g. bam/vcf indexing). You need to get familiar with these 3rd-party tools. When you can't find ready-to-use tools, you may have to explore new approaches. For example, you can import data to a SQL/NoSQL database, or take the advantage of Hadoop/spark. However, these are technically challenging to most biologists. You would probably want to leave them to bioinformaticians at this point.

score 7 · Answer 3 · 2017-05-07

7

Entering edit mode

7.5 years ago

John 13k

The 10 rules of doing bioinformatics:

Stream things. Try to make as little use of random access memory as possible. If you must store something in memory, try to consume all of it (idle swap space can seize up in cold weather).
Parallelize things, even when it conflicts with the first rule of bioinformatics. The overhead caused by context switching is always beneficial, as it reminds your OS who's boss.
Perform compression/decompression wherever possible, ideally in the middle of a pipeline/analysis. This will encourage teamwork. Do not use transparent full-disk compression or a compressed ramdisk.
Make heavy use of delimited tables of data. Is your data a graph? Find a way to put it into a table. If it's not in a table, you won't be able to use sed/grep/awk.
Start using sed/grep/awk. You can always learn the syntax later. These tools will allow you to read, parse, operate and write data simultaneously, consolidating all potential sources of error into 1 single low-cost monthly error.
Document nothing. Documentation is the smoking gun that may sink your career. If you're forced to say what you did, say that it was "standard practice". If thats not good enough, try prepending "gold" or "industry".
ASCII encoding is the universal transfer format. We know this because some UNIX guys said so in the 1980s. Keep as much data as you can in ASCII. If you don't do this, you may not be working with "big data".
Automated testing is a sign to other developers that you think your code is so buggy it needs continuous checking by robots. Similarly, software with fewer descriptive error messages obviously must contain less errors. It's basic math.
Do as much as you can in the cloud. If you don't have the money, you can simulate a network interface by storing your software/data in different VMs.
Random number generators are a great way to let people know you're a statistical badass, and so long as you do not let user's define the seed, your one tool and it's one input dataset can generate millions of potential Nature papers. If you cannot justify using a random number generator, try to at least include the current time (number of reads processed per minute is a good way to sneak this in, rather than a boring status bar), or some other statistic regarding the system state.

ADD COMMENT • link 7.5 years ago by John 13k

1

Entering edit mode

I'm afraid I fail to understand your humor. Your 10 rules are clearly sarcastic; however, if interpreted as such, you say that one should not stream and and not parallelize, despite that being precisely the way to efficiently process really big datasets. I'm sorry to have to say this, but I simply do not find your answer helpful.

ADD REPLY • link 7.5 years ago by Lars Juhl Jensen 11k

3

Entering edit mode

I think what John is trying to say is you can't just come up with a few golden rules and then attempt to apply them to every problem. Every "really big dataset" needs to be approached with love and care.

For example, there is no need to waste time trying to parallelize when disk I/O is your bottleneck.

ADD REPLY • link 7.5 years ago by igor 13k

1

Entering edit mode

First of all thanks for your contributions John and Lars! I agree with what Igor said. I realized that my question was a bit too general for the subject. So far, all the answers (even the more "general" one such as parallel & stream) are I think useful for people that never had any true programming course. At least they showed me what I should learn next to deal with such problem which was my point in creating this forum!

ADD REPLY • link 7.5 years ago by VHahaut ★ 1.2k

0

Entering edit mode

I partially agree, but only partially. I do think that there are some golden rules, which were summarized very well by lh3 and Pierre. However, as any golden rule, they cannot be applied in every case. Regarding I/O being a bottleneck, I have only ever experienced that when someone was doing something bad on the same disk (typically producing millions of tiny files).

ADD REPLY • link 7.5 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

Harsh words. Well I see you have brought compelling evidence that the only way to process really big datasets is with a specific kind of unix interprocess communication protocol. What do you define streaming as? It sounds a lot like you are defending the idea that big data requires everything to be serialized at some point in the analysis. Why? Who paid you? Was it Amazon Web Services? :P

As for parallelism, it's famously a last resort. Complexity balloons, bugs increase exponentially, and in many situations the code just needed a little bit of optimising and it would have been 10x faster anyway. The difference in performance between pigz and the new zlib code by Cloudfare is a neat example of this. Anyway i'm sorry you didn't find my answer helpful, but there's absolutely nothing i've said that I don't actually believe for good reason. If you want to talk about it i'm happy to talk.

ADD REPLY • link 7.5 years ago by John 13k

2

Entering edit mode

Sorry if I offended you, but your comments really rubbed me the wrong way. You came across as being deeply sarcastic of what everyone else had recommended. I felt someone needed to point this out.

Numerous bioinformatics processes required in analysis of large datasets, e.g. sequencing data, requires serial scans over the complete dataset.

A very typical thing, in my experience, that people do is to write scripts that first loads everything into memory, which simply doesn't scale to very large datasets. The trivial solution to that is to not do so, hence the recommendation to stream the data.

Very many of these tasks have complexity O(N) and are trivial to parallelize, and complexity is unaffected since the processing of each item is independent. Hence the recommendation of parallelizing using suitable tools like GNU parallel.

Regarding optimization of code there are certainly cases when you should do so. But if the script is a custom analysis that will likely only be run once or a few times, you have to consider the price tag on your time vs. the price tag on CPU hours. And if the analysis step is done with existing software packages, chances are that you cannot make them faster without investing substantial time.

All that clearly assumes that you have already checked that you're not doing something outright stupid, such as using the wrong data structures. I think we've all experienced those scripts that could be made >1000x faster by replace a linear scan over a list with a hash table. The solution to that is obviously not to try to parallelize your way out of it (which I have also seen).

Finally, I have no horse in this race. I use big clusters when that is called for and write highly optimized C++ code when that is the better solution. Over and out - sorry I will not have time to turn this into a long discussion.

ADD REPLY • link 7.5 years ago by Lars Juhl Jensen 11k

1

Entering edit mode

It's ok. I think this is just something we're not going to agree about, and i'll also admit I was a bit over the top. Try and look at it from my side, I see questions like this (Fistful of failures with Bowtie2 in Trinity) regularly, and I know there's no way to debug this. Streams happened. The error went down with the ship. Then i come back to this thread where I'm shamed for not thinking the only data structure for big data is a magnetic tape deck stuck on fast-forward. Its madness. But you don't fight madness with madness, and so i'm sorry for my tone and I certainly didn't mean to upset anyone if I did. Roll on Monday.

ADD REPLY • link 7.5 years ago by John 13k

1

Entering edit mode

Perhaps I am old school but I want to verify that results of step 1 look good before I take the second step. That question you linked to does sound like a good example of what can go wrong with piping. I would much rather spend additional time upfront than try to debug a complex streaming job that may have messed up along the way. With the multi-gigabyte datasets common now there is no hope of spotting/debugging corrupt data files.

ADD REPLY • link 7.5 years ago by GenoMax 146k

score 5 · Answer 4 · 2017-05-08

Work with streams. Work with data in parallel. Work on test datasets before you run a pipeline on the whole dataset.

Above all else: Validate your inputs. Garbage in, garbage out.

If someone says a file is some format XYZ, follow the proverb of Ronnie Reagan, or whoever he stole it from: "Trust, but validate."

score 3 · Answer 5 · 2017-05-07

Others already said some of the key things:

Calculate on streams instead of loading everything into memory. Pipes are great for that, but far from the only strategy.
Perform independent calculations on different parts of the data in parallel. That way you can utilize compute clusters efficiently.

These two approaches can very often be combined to great effect. In the example above where streaming is used to calculate a mean, one could at the end print out x and NR (instead of x/NR). This would allow you to divide your data into, e.g., 1000 parts and calculate the partial sum and partial count for each. With those at hand, you can trivially calculate the total sum, total count, and thereby mean. This is in essence the idea of map-reduce frameworks, where the map step can be done independently for each part (i.e. the partial sums) and the reduce step subsequently combines them to obtain the overall result. You can easily use this idea without having to deal with big frameworks like Hadoop.

Other than that I will just add that testing becomes increasingly important when working on very big datasets. You don't want to debug your program by running on the full dataset.

score 3 · Answer 6 · 2017-05-07

Sort of dodging the issue... Does your question really need hundreds of millions of records to be answered? If not, then you could consider downsampling the data and work with a smaller, random dataset that you can comfortably handle in R. If your analysis involves fitting a linear regression or other statistical models it may be that, say, 1 million records are more than enough.

In some cases downsampling may be trivial to implement, for example by retaining every Nth row in the dataset. But in some cases it may be much more complicated of course.