Question

Batch Correction and Batch Size

0

Entering edit mode

10.0 years ago

tasjfasfankihj ▴ 10

I have pooled 123 samples together from two GEO antibody microarray studies, which used the same platform. I downloaded the raw .gpr files and opened each one in Excel to get the scan date of each sample (presumably represented by the variable DateTime), which I recorded in another excel sheet.

My understanding is that if two samples have a different scan date, they are from different batches. If so, then the 123 samples breakdown into the following batches:

Batch 1: 4 samples
Batch 2: 2 samples
Batch 3: 2 samples
Batch 4: 4 samples
Batch 5: 4 samples
Batch 6: 8 samples
Batch 7: 8 samples
Batch 8: 8 samples
Batch 9: 8 samples
Batch 10: 8 samples
Batch 11: 12 samples
Batch 12: 7 samples
Batch 13: 12 samples
Batch 14: 3 samples
Batch 15: 6 samples
Batch 16: 2 samples
Batch 17: 4 samples
Batch 18: 1 sample
Batch 19: 3 samples
Batch 20: 3 samples
Batch 21: 4 samples
Batch 22: 2 samples
Batch  23: 4 samples
Batch 24: 2 samples
Batch 25: 2 samples

Should I keep the above delineation of batches or should I combine small batches? Any advice?

Also, Batches 1-14 were from 11/9/2010 - 12/17/2010, while batches 15-25 were from 3/23/2012 - 4/27/2012.

Thanks,
James

Batch-Adjustment Microarray Batch-Correction • 3.2k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by tasjfasfankihj ▴ 10

1

Entering edit mode

Make a PCA plot and/or cluster the sample and see how they group. That's usually an effective way to gauge batch effects. Also, have a look at combat() in the SVA Bioconductor package.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Devon Ryan 104k

2

Entering edit mode

Also, never, ever use Excel for anything bioinformatics.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by 5heikki 11k

0

Entering edit mode

Thanks for the response, but that didn't really answer my question. Although I probably wasn't the most clear. Basically:

Am I correct in organizing the 123 samples into 25 batches in the way that I did? Since posting this question, I've realized each sample's .gpr file has, along with DateTime, a GalFile variable with values such as: GalFile = C:\Users\Genepix\Desktop\ProtoArray\HA20251.gal. The item of interest here is the HA20251, which I recalled seeing somewhere in the provided .xls workbook of processed data as a "lot number". Should I consider a batch to be "samples with the same lot number" (i.e. 1 batch would be all the samples with "HA20251" in their .gal file address), or should I keep my batch definition to "all samples with the same day in their DateTime variable".

Essentially, I'm hoping to extract from the provided data files an explicit batch identification for each sample to be used in a Target file in order to upload the data into the PAA R package to then apply batch adjustment. If I can't get explicit batch identifiers (which I think I can), then I'll need algorithms to "discover" batch effects.
Assuming I was correct in organizing the 123 samples into 25 batches the way that I did, is it problematic to have batches of size 1 and 2? Is there a motivation for combining small batches with a nearby neighbor? For example, suppose 1 sample was scanned on Monday, and 7 samples were scanned on Tuesday, the day after. Would it make more sense to consider them as Batch1 = 1 sample, Batch2 = 7 samples, or to have all 8 samples in one batch?

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by tasjfasfankihj ▴ 10

0

Entering edit mode

I answered the question you should have asked, rather than the one you did ask :)

The way you're doing it currently seems correct. Perhaps using instead HA20251/etc. as a batch identifier would work better, but the only way to know would be to contact the people who produced the data (or cluster things as I suggested earlier).
Batches of size 1 end up becoming useless. A batch of size 2 may be useful, depending on whether the batch members are all from the same treatment group or not (it's better if they're not).