In general, CDFs do not contain sequence information, unless they have been customised to contain a SEQUENCE field. A CDF maps probes to probesets and probesets to (X,Y) coordinates on the chip, hence the name (chip descriptor file). CBASE
, PBASE
and TBASE
refer to the nucleotides at positions 12, 13 and 14 in the probe.
To get probe sequences for the U133 Plus 2.0 file, go to the Affymetrix product page for that array. From there, you can download either a FASTA file or a tabular file. You'll need to create an account and/or login first.
Even if your CDF is customised, there should be matching probeset IDs with the original product file. If you want to get probeset IDs for a particular gene, you can use BioMart, either via the web, or using the Bioconductor biomaRt package. Here is some sample R code, to find the probesets for gene HOXB13
:
library(biomaRt)
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
results <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", \
"affy_hg_u133_plus_2"), filters = "hgnc_symbol", \
values = "HOXB13", mart = mart)
results
ensembl_gene_id hgnc_symbol affy_hg_u133_plus_2
1 ENSG00000159184 HOXB13 230105_at
2 ENSG00000159184 HOXB13 209844_at
From there, you can go back to your FASTA file and pull out the probe sequences for those probesets.
Problem with custom CDFs is that the contents vary, because they're...customised. Can you post a link to the custom CDF download location, so we can look at it?