Question

When is better to pool samples?

0

Entering edit mode

6.6 years ago

ste.lu ▴ 80

Hi All,

I was thinking about a pipeline to build but I have a doubt on how to set a step in it. I'll try to keep the concept as broad as possible because I am also interested in the math behind it and other applications.

I have lot of samples which have to go through the same process and, in the end, I want to pool all the samples together. What is the best way: a. pool them and send them through the process b. send them through the process and pool the results

what are the benefits and drawbacks of each?

Thank you

Assembly software error math • 1.2k views

ADD COMMENT • link updated 6.4 years ago by Biostar 20 • written 6.6 years ago by ste.lu ▴ 80

1

Entering edit mode

Without knowing full details of what you are trying to do b would be better since you are parallelizing your processing and can go through everything faster than a.

ADD REPLY • link 6.6 years ago by GenoMax 148k

0

Entering edit mode

Completely agree with your answer in terms of computational power. But what about the Math/statistics behind it, are they exactly the same thing? It always depends on the task I am talking about?

ADD REPLY • link 6.6 years ago by ste.lu ▴ 80

1

Entering edit mode

It always depends on the task I am talking about?

Likely. If operations you are doing are independent of each other (e.g. splitting a billion sequence file into 100 chunks and starting 100 alignments against the same reference genome as opposed to one alignment job) then doing b would always be preferable/faster (as long as you have resources available) as the resulting alignments can be merged later. But if an operation is dependent on content (e.g. an assembly job where a sample was sequenced on multiple lanes/flowcells) then pooling the data would be required before starting the job to avoid biases.

I can't comment on the theoretical implications of a vs b but someone else may do that.

ADD REPLY • link 6.6 years ago by GenoMax 148k