Hi,
I'm new to microarray data formats and I'm wondering exactly what state this data is in (normalised?, CEL (txt version)? It's publically available data from the Osteosarcoma study from NCI TARGET. It's Affymetrix SNP 6.0 (human). TIA
Example here:
SampleName Chromosome Start End Num Probes Segment Mean
TARGET-40-0A4HLD-01A-01D chr1 61735 709822 36 1.4796
TARGET-40-0A4HLD-01A-01D chr1 709822 4114370 1155 2.72479
TARGET-40-0A4HLD-01A-01D chr1 4114370 4408030 229 3.78919
TARGET-40-0A4HLD-01A-01D chr1 4408030 4417924 10 2.2115
TARGET-40-0A4HLD-01A-01D chr1 4417924 6229890 1540 3.62128
TARGET-40-0A4HLD-01A-01D chr1 6229890 12860827 3674 3.09346
So this is un-normalised data, correct? And if I wanted normalised data (say L3 RMA-normalised) I would have to obtain the CEL/CDF files? Am I correct about this for normalisation? TIA
Given that it is reporting the 'segment mean', and number of probes per segment, I'm assuming that the raw data has undergone circular binary segmentation (CBS), a popular algorithm for determining copy number,; therefore, it would be normalised and effectively ready for use. Can you give an exact source (link / URL) from where you got it?
Sure, it's from the Osteosarcoma public data set at TARGET: https://ocg.cancer.gov/programs/target/data-matrix
Thanks for the link. I can see ( here ) that they processed the copy number data using Partek Genomics Suite, but they do not provide much extra information. It can be assumed that it is definitely normalised, though. Basically, if you plot out that data in a karyoplot, you'll see gains and losses across the genome. One thing that you could do with it is overlap with known genes, too.
Okay, thanks - this is very helpful. What we're ultimately trying to do is to use a programme called "InFlo" that makes use of datasets (RNA-Seq; WGS; WXS; SNP; Methylation) to build network maps. So, their input example file (https://github.com/VaradanLab/InFlo) is apparently L3 RMA Normalised (http://felixfan.github.io/RMA-Normalization-Microarray/) and looks like this:
"Gene_Name" "TCGA-24-1545-01" "TCGA-13-2060-01" "TCGA-24-1550-01" "A2BP1" 0.201615042245111 0.135470000274077 0.0185439399735216
So, my real question is, can I get to the RMA Normalised data from what NCI TARGET makes available, or do I need the CEL data and normalise it myself?
So, you need expression data? Then you should be looking for the Affymetrix expression arrays, no? If the CEL files are there, then I would just process them myself via RMA, but not sure of your experience in that area?
Okay, I've found the CEL files, the CHP files and studied how to do this via Bioconductor in R. I think that the RMA has been done and might already be in the CHP file(s). However, in the event that I need to normalise myself, I would have to choose an appropriate database file for the normalization. The documentation states that the library files needed are "GenomeWideSNP_6"; which I found here (http://bioconductor.org/packages/3.9/data/annotation/). Would this be the correct library file(s)?
Well, the CEL files are the raw data files - they are produced by the camera device that scans the fluorescent intensities on the chips.
The CHP files are, if my memory serves me correctly, the genotype files produced by the Affymetrix Genotyping Console, so RMA will not have been used by these. They are most likely literally the genotype calls for the SNP probes, likely called by the Birdseed or Birdseed 2 algorithm.
You will probably have to start from the CEL file stage, and follow the guidance on the Aroma Affymetrix web-site for the SNP 6.0