Question

Batch effects from sequencing samples accross multiple flow cells.

3

Entering edit mode

4.3 years ago

Mat ▴ 80

I would like to have bulk RNA sequencing of around 50-100 samples performed on a NovaSeq 6000. I am not sure how to judge potential batch effects if the samples are distributed across multiple flowcells (e.g. 20 samples per flow cell and together with samples from other studies) vs all samples on a single "exlusive" flow cell. In general would estimate batch effects from using different flow cells as rather low. Does anyone has experience with it or even better, can recommend a paper?

rna-seq batch batch-effect sequencing • 2.5k views

ADD COMMENT • link updated 16 months ago by Ram 45k • written 4.3 years ago by Mat ▴ 80

2

Entering edit mode

Hello, I cannot link a paper (others may) but my feeling based on the discussions here and elsewhere is that there is generally a consensus that batch effects based on flow cells and even different Illumina machines (at least the more recent ones) are minimal to basically non-existing. The true technical variation comes prior to the sequencing (RNA extraction, kits for library prep, presence of contaminants, RNA degradation), whereas the sequencing itself stards from the final DNA library (notably more robust than RNA, little change of contaminant-based degradation) and is extremely standardized based on the Illumina guidelines. I am sure you are aware on how to diagnose all this post-hoc using PCA (or similar) but as said I would not expect and relevant batch effect. It was and is standard to split runs over different lanes or flow cells, especially in the era before the Novaseq came out and outputs per run were smaller.

You could also pool all libraries into a single tube and then run across as many lanes/flowcells you need to achieve the desired depth, so a potential batch effect would not confound certain samples.

ADD REPLY • link 4.3 years ago by ATpoint 89k

2

Entering edit mode

Follow ATpoint 's suggestion:

You could also pool all libraries into a single tube and then run across as many lanes/flowcells you need to achieve the desired depth, so a potential batch effect would not confound certain samples.

Where possible use largest flowcell e.g. S4. They are more economical anyway. For 100 samples x 30 million reads each you are looking at most 2 lanes on S4 flowcell (or a S2 FC) for human samples. If you have a smaller genome then you will be looking at fewer reads. Be sure to ask your provider to prep the libraries at the same time to reduce any batch effects there. At 100 samples they will likely be using a robot. Doing a test MiSeq nano run to balance the library pool is well worth the small investment in cost.

ADD REPLY • link 4.3 years ago by GenoMax 153k