Entering edit mode
5.7 years ago
Sebastian Hesse
▴
350
Working with a proteome dataset I would like to check and correct for batch effects. For batch effect detection I am using PVCA, for batch effect correction LIMMAS removeBatchEffect. Data come from DIA proteome analysis, quantified using Biognosys Spectronaut 11, details can be found in our first paper about it. Quantitative data were log2 transformed before analysis and correction.
Using PVCA I get the following result for my uncorrected data:
I have read that for batch correction to be valid, data must be well balanced. But what exactly does this mean?
- I compare proteomes of patients with different disease genotypes to healthy. I suspect that as I have a lot of (12) different genotypes including many unknowns I will need to use only disease vs no disease as the status to be protected? I have in total 70 healthy and 70 patients but using the different genotypes the patient group would be widely dispersed with many single ones and many unknowns (that definitely do not all have the very same genotype).
- Is it important that my factors to correct for are evenly distributed by themselves or also in combination if I want to correct for multiple? EG: My samples are well balanced for date processed and also for cell number (meaning I have in both cases roughly equal amounts of healthy and patient samples). But if I check for the distribution using date AND cell number the balance is quite off (with few combinations having no patients or no healthies). So is it required that both are balanced together or is it fine as long as each factor is balanced for itself?
- Is its fine to use extreme correction measures (eg date.processed, protease.inhibitor, cell number and age) for data that are used for visualisation and clustering while in LIMMA blocking only for date_processed (and protease inhibitor)? When blocking in LIMMA, is it important to again check for well balanced distribution, again of all factors in combination? (I suspect that yes and that I will need to decide for one factor to correct for and keep the rest as it is)
Thanks a lot for your suggestions!
Sebastian
Disclaimer first: I've never used that package, so I may be missing something that's obvious to you.
Regarding some of your questions:
Generally, unless you have at least two samples per unique group, there's no point because you have no way of estimating the variability within the group. Limma can handle unbalanced design, though (but will balk if there's less than two in a specific group).
Actually limma does not require >= 2 two samples in each group. A specific group of size n=1 is no problem. limma requires at least two samples in at least one group, not at least two samples in all groups.
Batch correction is a different matter of course. Batches can't be confounded with the factors of interest.