Question

Batch effect in integrated RNA-seq analysis

0

Entering edit mode

21 months ago

Sakura • 0

Hi,

I'm beginner in bioinformatics as PhD candidate in Japan. I am trying to do an integrating bulk RNA-seq analysis using some public transcriptomic data from multiple facilities, but I'm struggling with batch effect correction. When I try to draw heatmap using TPM value, the result shows different gene expression patterns only for samples at a particular facility.

Indeed, these samples were sequenced using a different sequencer at a different facility. Also, these samples were sequenced as paired-ended while the others were single-ended.

Then, I have corrected normalized TPM data by using Combat-seq, but the PCA plots still present artifacts that are clearly due to inter-facility differences.

I would be grateful if someone could give me some advice.

Best wishes

batch-effect RNA-seq Combat • 4.2k views

ADD COMMENT • link updated 20 months ago by LChart 4.5k • written 21 months ago by Sakura • 0

score 2 · Answer 1 · 2023-03-01

2

Entering edit mode

21 months ago

swbarnes2 14k

RNASeq is very sensitive to batch effects. It is generally not possible to take experiments from different sources and combine them. Batch differences will always obscure, if not blot out, biological differences.

ADD COMMENT • link 21 months ago by swbarnes2 14k

0

Entering edit mode

Thank you for your reply.

I know there are a number of papers reporting integrated RNA-seq analysis using different data sources, but is it still not scientifically valid?

ADD REPLY • link 21 months ago by Sakura • 0

0

Entering edit mode

It would depend on the extent of the batch effect and the nature of the measures taken to correct for it. Do you have papers in mind that use a similar design to what you are hoping to accomplish?

Studies such as https://www.science.org/doi/10.1126/science.aat8127 explicitly include batch as a covariate when testing for DE; with both cases and controls present in every batch. This corrects for any mean differences between batches. For visualization, this model (less the binary outcome of interest) is used as a regression for each gene; and the residuals are visualized. Note that in cases where there is stratification between outcome and batch (i.e., differences in group proportions between batches), some of the group effects will load onto the batch effect, and be removed by this approach.

ADD REPLY • link 21 months ago by LChart 4.5k

0

Entering edit mode

Dear Dr LChart,

Thank you for your helpful suggestion. Could you let me know more about "Studies such as https://www.science.org/doi/10.1126/science.aat8127 explicitly include batch as a covariate when testing for DE"

Is it different from batch correction like Combat-seq ?

BW

ADD REPLY • link 20 months ago by Sakura • 0

1

Entering edit mode

They include a batch variable in their regression model model like

expression ~ disease + BATCH + [other covariates]...

ADD REPLY • link 20 months ago by LChart 4.5k

0

Entering edit mode

Thank you really, doctor. So, when they calculate expression ratios (m-value, p-value) from expression value such as TPM, they use a regression model that includes the batch as a covariate?

I want to know how to define the batch covariate here, as I could not fine the specific calculation formula in the paper above.

Or may I ask for your direct contact if you would not mind.

Best,

ADD REPLY • link 20 months ago by Sakura • 0

0

Entering edit mode

You mention seeing stratification by facility. If you're performing the analysis in R, you can use a variable that stores the facility names such as:

sample_number    sex   facility        group
            1      M      lab_A      Treated
            2      M      lab_A      Control
            3      F      lab_B      Control
            4      F      lab_C      Treated

(etc); and you could regress expr ~ group + sex + facility + 0

to see group effects controlling for sex and facility effects.

ADD REPLY • link 20 months ago by LChart 4.5k

score 0 · Answer 2 · 2023-03-02

0

Entering edit mode

21 months ago

John Ma ▴ 310

You mean bulk, or single-cell? The rule of thumb is for single-cell, there'd be batch effects between runs of the microfluidics instrument. For that, I usually use scvi-tools, mostly because I frequently analyse scRNA-seq + CITE-seq data. Seurat also does this using different algorithms.