I am looking for references about batch effect prevention (before carrying out the experiments). I'm interested in the discussion about samples distribution and replicates, but can't seem to find good references (there are many to pick up after the experiments are run but not before).
I tried searching by "batch effect prevention", "confounding effect design", "batch effect design" and similar searches but not highly relevant for my interest. The best article I could find so far is this one: Replication. Could you help me find relevant literature?
You could 1) look at batch-effect correction algorithm papers and see what they cite in their intros/discussion or 2) look at which papers have cited them, through Google Scholar. I often find this very useful in order to follow certain research lines. For example among the papers that cite the SVA paper or this other batch-correction review you may find other related review papers.
It may be kind of tough but it sometimes leads you to places that keywords don't.
You may be looking for the wrong term(s). As you allude to, preventing batch effect is primarily a matter of designing the experiment to that effect. Try searching for terms like experimental/study design and/or confounding factors. Also given the diversity of experiments, I don't think there's going to be a generic solution. In many cases, the source of the effect is either unknown a priori or unavoidable. In my opinion, you should think carefully about your specific situation and try to identify possible sources of unwanted technical variability. Papers may give you ideas but may not cover your specific case.
Yes, I might be using the wrong key words, that's why I posted them. Using designing experiments didn't help so far because they tend to be about power or paired analysis. On practice after designing an experiment some way, technical or experimental reasons might cause some samples to go missing then it is too late to change the experimental design or you might be halfway through it. For instance, if you designed a paired RNAseq analysis but half of the samples of a group fail, how do you continue with your design? Could you increase the technical replicates of some samples?
This is an illustration of what I meant by unknown a priori/unavoidable. If you suspect data will be missing/unusable, then you need to ensure that you have enough samples so that you can either throw away the problematic ones or redo the experiment or have enough data to apply an imputation method.
The best analysis always starts with good data so when possible I tend to prefer redoing experiments to fiddling with methods and parameters. If you think there could be technical issues, it may be a good idea if possible to do a pilot run to identify the problematic areas.
It's hard to be specific because the design depends on the specifics of your situation, e.g. do you have the resources to do a pilot or reprocess failed samples?
On a second read, I think you might be interested in a tool I developed that aims to reduce the problem without too much fizzling with parameters or methods. I created experDesign precisely to deal with this unavoidable problems but before running the experiment in batches. The references I'm looking for are for an article about it. Thanks for all the help
If this is your goal, maybe you'll benefit from using keywords more related to "stratified sampling experimental design" and similar. Also, take a look at this package/publication (Omixer) which is very related to your goal (never used it, I found it by chance while trying those keywords).
Thanks for the keywords and the link, will look into it.
Thanks for the link. Looks useful indeed.
The example I used is extreme and hasn't happened yet. But on my group sometimes it is impossible to redo an experiment. We collect samples of some patients at specific timepoints with a specific treatment and process them to get RNA. It is not possible to anticipate where/when a sample and when they fail it is usually too late to recollect a sample or reprocess a sample. Designing a bigger experiment to account for drop off would mean running longer (which there isn't money for) and this also difficult as to have enough samples for a study can take more than 5 years. So, I don't think there is any option on designing better.
You don't even say what kid of experiment you are doing. For DNA sequencing, batch effect likely doesn't matter, for RNASeq, it does a lot.
It does matter even for DNA sequencing as well as for any other type of data. Basically, any type of experimental data can be biased by batch effect/confounding factors.
I want something general not specific to a technology, microarray, RNA-seq or scRNA-seq. Although I mostly work with RNA-seq. Thanks!