Hi all,
I posted this on bio-conductor but I think it may be more appropriate here.
About the data: I have 5 tissues, over 100 samples , and 2 variables of interest: RFI (High, Low) and Trial (1, 2). The trial variable is basically a surrogate for genotype, as the main difference between trials is the genotype of the animals. All samples were collected, and then processed in the lab by the same person. I don't know the sex of each animal (but that can be obtained from the data with a bit of work). I have no other batch information.
My question: I don't want to apply sva to model a hidden batch until I am confident there actually is a hidden batch. The problem is, I need guidance to know what evidence of a hidden batch looks like. I have read that hidden batches should be evident after exploratory data analysis. For clarification, I'm showing plenty of EDA images here to help my own understanding of replies.
Thank you all in advance, Kenneth
Exploratory Data Analysis results PCA separates intestinal tissues from liver, and kidney very well, with 1 outlier that has now been removed but there is no clear separation between Ileum and Jejunum even when intestinal tissues are plotted without liver and muscle:
Within individual tissues, PCAs are showing some clustering by variables of interest but I don;t see any extra groups, or groups of samples sitting way off by themselves (which I think would be evidence of a hidden batch effect):
The heatmaps however are where I need a bit of guidance. Duodenal tissue is clustering weakly by trial, but Ileum, Jejunum, and Muscle show strong clusters not attributable to the variables of interest. Can I consider this evidence of a hidden batch in those tissues or could they just be biological signal that is stronger than the variables of interest? Should I use sva on these tissues or not?
Batch effects relates to when samples are processed in batches either by different people, times, machines etc. I'd ask the wetlab folks how they processed these samples and in what batches, eg did they run 5 samples at once?
With this info you can then do the analysis you conducted (excellent btw) but color / group by any batch information. If you see anything that is not biological clustering ( say clustering by sex of animal) but by batch then you should investigate further.
Btw you need to find the sex of the subjects that is important.
On a related topic, batch effects fall under technical variation, variation that arises from the technical side of the experiment. This needs to be ruled out as it can result in say false positives.
On the other hand, biological variation is something we can be interested in. Eg the difference in expression between sexual and asexual stages of a parasite. But sometimes biological variation needs to be accounted for because we're interested in how animals respond to vaccines.
Imo with little information I think you want to account for the sex or the animal before you investigate genotype / tissue differences. This can be done by account for sex in your model.
I appreciate the reply @Mark. "If you see anything that is not biological clustering ( say clustering by sex of animal) but by batch then you should investigate further." - very useful, thank you
I am aware of the other points you made so apologies if my question wasn't specific enough.
So just to clarify, there is no more metadata available and I can (and will) find out the sex from the data.
However, even after I find out the sex of each sample, the main question still needs answering... How can EDA tell me if there is still a HIDDEN batch effect not accounted for by the AVAILABLE metadata?
My understanding from reading is this (please anyone correct me if Im wrong):
1) Batch correction must be justified with evidence otherwise we are basicaly fitting the data blindly.
2) We get that evidence from the metadata and EDA.
3) Metadata gives us sources of technical and biological variation, but is usually incomplete
4) EDA then answers 2 questions: a) is batch correction necessary?, and b) if it was necessary, has it been successful?
This were where I'm stuck. I don't know what I'm looking for in the EDA.
Regarding the model: I'll be analysing each tissue separately with deseq2 (~RFI+Trial+RFI:Trial) then compare the up/down lists for each tissue in some sort of nightmarish Venn diagram lol.
Sorry for being so longwinded, and thanks again for the reply
Kenneth