Hello everyone,
I am new to microarray analysis. I am attempting to do basic QC for gene expression data obtained from Affymetrix Hg-U133a chips, before exploring different methods of normalization.
I been working through various tutorials online and have been looking at output from Bioconductor's simpleaffy package qc function, and affyPLm's fitPLM function.
My data set contains some poor quality image files, raw probe intensity density plots reveal a number of outliers, and a number of Cel files fail QC metrics such as 3'/5' beta actin and GAPDH ratios, etc.
I wish to exclude such files from downstream analysis.
However, from Gentleman, R. et al.2005. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer. and Gentleman, R. 2007. Some Quality methods for affymetrix microarrays. http://www.bioconductor.org/help/course-materials/2007/biocadv/Labs/AffyQuality/AffyQuality.pdf I see that RNA degradation plots and NUSE values are not comparable across data sets. I guess this applies to other metrics.
How loosely can "data set" be applied?
My "data set" consists of 150 CEL files, from independent samples, processed and run on numerous dates over the course of a several years, from a single study.
Is it valid to perform these QC procedures treating my CEL files as a single data set? If not, are there any valid methods of performing QC in such situations?
Thanks in advance for any assistance.
Hi Maciej,
Thank you for your reply.
This is as I had feared. The arrays were processed over several years and forty runs.
Are there QC strategies that one can apply to such data in order to remove outliers, as NUSE plots and density histograms can't validly be applied to my data as a whole?
I was intending to look for batch effects downstream of QC and normalization, but am not sure if analysis of these CEL files is feasible, or if it is something that should be attempted?
I am doubtful, but hopeful. I would be grateful for advice, even if that advice is that analysis is not achievable
If the data looks good within runs (which probably corresponds to replications) you can try to specify batch as as blocking factor in your analysis.