I have pooled 123 samples together from two GEO antibody microarray studies, which used the same platform. I downloaded the raw .gpr files and opened each one in Excel to get the scan date of each sample (presumably represented by the variable DateTime), which I recorded in another excel sheet.
My understanding is that if two samples have a different scan date, they are from different batches. If so, then the 123 samples breakdown into the following batches:
Batch 1: 4 samples
Batch 2: 2 samples
Batch 3: 2 samples
Batch 4: 4 samples
Batch 5: 4 samples
Batch 6: 8 samples
Batch 7: 8 samples
Batch 8: 8 samples
Batch 9: 8 samples
Batch 10: 8 samples
Batch 11: 12 samples
Batch 12: 7 samples
Batch 13: 12 samples
Batch 14: 3 samples
Batch 15: 6 samples
Batch 16: 2 samples
Batch 17: 4 samples
Batch 18: 1 sample
Batch 19: 3 samples
Batch 20: 3 samples
Batch 21: 4 samples
Batch 22: 2 samples
Batch 23: 4 samples
Batch 24: 2 samples
Batch 25: 2 samples
Should I keep the above delineation of batches or should I combine small batches? Any advice?
Also, Batches 1-14 were from 11/9/2010 - 12/17/2010, while batches 15-25 were from 3/23/2012 - 4/27/2012.
Thanks,
James
Make a PCA plot and/or cluster the sample and see how they group. That's usually an effective way to gauge batch effects. Also, have a look at combat() in the SVA Bioconductor package.
Also, never, ever use Excel for anything bioinformatics.
Thanks for the response, but that didn't really answer my question. Although I probably wasn't the most clear. Basically:
Am I correct in organizing the 123 samples into 25 batches in the way that I did? Since posting this question, I've realized each sample's .gpr file has, along with
DateTime
, aGalFile
variable with values such as:GalFile = C:\Users\Genepix\Desktop\ProtoArray\HA20251.gal
. The item of interest here is the HA20251, which I recalled seeing somewhere in the provided .xls workbook of processed data as a "lot number". Should I consider a batch to be "samples with the same lot number" (i.e. 1 batch would be all the samples with "HA20251" in their .gal file address), or should I keep my batch definition to "all samples with the same day in theirDateTime
variable".Essentially, I'm hoping to extract from the provided data files an explicit batch identification for each sample to be used in a Target file in order to upload the data into the PAA R package to then apply batch adjustment. If I can't get explicit batch identifiers (which I think I can), then I'll need algorithms to "discover" batch effects.
Assuming I was correct in organizing the 123 samples into 25 batches the way that I did, is it problematic to have batches of size 1 and 2? Is there a motivation for combining small batches with a nearby neighbor? For example, suppose 1 sample was scanned on Monday, and 7 samples were scanned on Tuesday, the day after. Would it make more sense to consider them as Batch1 = 1 sample, Batch2 = 7 samples, or to have all 8 samples in one batch?
I answered the question you should have asked, rather than the one you did ask :)
Good to know that helped a lot thank you!