Hi,
I'm beginner in bioinformatics as PhD candidate in Japan. I am trying to do an integrating bulk RNA-seq analysis using some public transcriptomic data from multiple facilities, but I'm struggling with batch effect correction. When I try to draw heatmap using TPM value, the result shows different gene expression patterns only for samples at a particular facility.
Indeed, these samples were sequenced using a different sequencer at a different facility. Also, these samples were sequenced as paired-ended while the others were single-ended.
Then, I have corrected normalized TPM data by using Combat-seq, but the PCA plots still present artifacts that are clearly due to inter-facility differences.
I would be grateful if someone could give me some advice.
Best wishes
Thank you for your reply.
I know there are a number of papers reporting integrated RNA-seq analysis using different data sources, but is it still not scientifically valid?
It would depend on the extent of the batch effect and the nature of the measures taken to correct for it. Do you have papers in mind that use a similar design to what you are hoping to accomplish?
Studies such as https://www.science.org/doi/10.1126/science.aat8127 explicitly include batch as a covariate when testing for DE; with both cases and controls present in every batch. This corrects for any mean differences between batches. For visualization, this model (less the binary outcome of interest) is used as a regression for each gene; and the residuals are visualized. Note that in cases where there is stratification between outcome and batch (i.e., differences in group proportions between batches), some of the group effects will load onto the batch effect, and be removed by this approach.
Dear Dr LChart,
Thank you for your helpful suggestion. Could you let me know more about "Studies such as https://www.science.org/doi/10.1126/science.aat8127 explicitly include batch as a covariate when testing for DE"
Is it different from batch correction like Combat-seq ?
BW
They include a batch variable in their regression model model like
expression ~ disease + BATCH + [other covariates]...
Thank you really, doctor. So, when they calculate expression ratios (m-value, p-value) from expression value such as TPM, they use a regression model that includes the batch as a covariate?
I want to know how to define the batch covariate here, as I could not fine the specific calculation formula in the paper above.
Or may I ask for your direct contact if you would not mind.
Best,
You mention seeing stratification by facility. If you're performing the analysis in R, you can use a variable that stores the facility names such as:
(etc); and you could regress
expr ~ group + sex + facility + 0
to see group effects controlling for sex and facility effects.