Entering edit mode
6.6 years ago
ste.lu
▴
80
Hi All,
I was thinking about a pipeline to build but I have a doubt on how to set a step in it. I'll try to keep the concept as broad as possible because I am also interested in the math behind it and other applications.
I have lot of samples which have to go through the same process and, in the end, I want to pool all the samples together. What is the best way: a. pool them and send them through the process b. send them through the process and pool the results
what are the benefits and drawbacks of each?
Thank you
Without knowing full details of what you are trying to do
b
would be better since you are parallelizing your processing and can go through everything faster thana
.Completely agree with your answer in terms of computational power. But what about the Math/statistics behind it, are they exactly the same thing? It always depends on the task I am talking about?
Likely. If operations you are doing are independent of each other (e.g. splitting a billion sequence file into 100 chunks and starting 100 alignments against the same reference genome as opposed to one alignment job) then doing
b
would always be preferable/faster (as long as you have resources available) as the resulting alignments can be merged later. But if an operation is dependent on content (e.g. an assembly job where a sample was sequenced on multiple lanes/flowcells) then pooling the data would be required before starting the job to avoid biases.I can't comment on the theoretical implications of
a vs b
but someone else may do that.