Question

Getting Probeset Sequence Information From A Custom Cdf?

4

Entering edit mode

15.0 years ago

Sam ▴ 90

Hello, I downloaded a custom CDF file for Affymetrix U133plus2.0 arrays. I am trying to see if I can get the probeset sequence information from this file for a particular Gene. Can anyone help me do this? I looked in the file and see some information about cbase, pbase, and tbase. Is that the place to find the information?

[EDIT: text below moved here from answer]

I am actually using one of those remapped CDFs of the U133plus2.0, so I am most interested in the probe level information...the sequences that are actually making up my new probeset. I suppose this is a tougher task than anticipated

affymetrix • 7.5k views

ADD COMMENT • link updated 10.9 years ago by Biostar 20 • written 15.0 years ago by Sam ▴ 90

0

Entering edit mode

Problem with custom CDFs is that the contents vary, because they're...customised. Can you post a link to the custom CDF download location, so we can look at it?

ADD REPLY • link 15.0 years ago by Neilfws 49k

Michael Kuhn · Answer 1 · 2010-08-06

7

Entering edit mode

15.0 years ago

David Quigley 11k

The CDF does not contain probe sequences. That information can be downloaded from Affymetrix's web site under Support (free registration required), then select Annotation Files for the platform you want. Sequence information is stored for probes in a FASTA file you can download; the one I think you want is

http://www.affymetrix.com/analysis/downloads/data/HG-U133Plus2.probe_fasta.zip

So long as the probeset IDs (e.g. "1007sat") can be pulled out of your file, you should be able to match them to this Fasta file. The probes have identifiers of the form:

probe:HG-U133A2:1007sat:416:177; InterrogationPosition=3330; Antisense;

ADD COMMENT • link updated 13.1 years ago by Michael Kuhn 5.0k • written 15.0 years ago by David Quigley 11k

0

Entering edit mode

I think they want the Plus 2.0 file, at http://www.affymetrix.com/Auth/analysis/downloads/data/HG-U133_Plus_2.probe_fasta.zip .

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 15.0 years ago by Neilfws 49k

0

Entering edit mode

Thanks, I noticed that immediately after I posted it. The link is correct in the original response.

ADD REPLY • link 15.0 years ago by David Quigley 11k

Ram · Answer 2 · 2010-08-06

In general, CDFs do not contain sequence information, unless they have been customised to contain a SEQUENCE field. A CDF maps probes to probesets and probesets to (X,Y) coordinates on the chip, hence the name (chip descriptor file). CBASE, PBASE and TBASE refer to the nucleotides at positions 12, 13 and 14 in the probe.

To get probe sequences for the U133 Plus 2.0 file, go to the Affymetrix product page for that array. From there, you can download either a FASTA file or a tabular file. You'll need to create an account and/or login first.

Even if your CDF is customised, there should be matching probeset IDs with the original product file. If you want to get probeset IDs for a particular gene, you can use BioMart, either via the web, or using the Bioconductor biomaRt package. Here is some sample R code, to find the probesets for gene HOXB13:

library(biomaRt)
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
results <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", \
           "affy_hg_u133_plus_2"), filters = "hgnc_symbol", \
           values = "HOXB13", mart = mart)
results
ensembl_gene_id hgnc_symbol affy_hg_u133_plus_2
1 ENSG00000159184      HOXB13           230105_at
2 ENSG00000159184      HOXB13           209844_at

From there, you can go back to your FASTA file and pull out the probe sequences for those probesets.

score 2 · Answer 3 · 2010-08-06

I'm not sure that info is actually contained in the CDF file. My understanding has always been that the CDF file only keeps track of which propes are in each probeset. If your custom CDF is in GEO then they often have a link to the sequences. If you got it from some other website then you'll have to root around in there.

Ram · Answer 4 · 2012-01-24

You can also get probe sequence information for stock CDFs this way:

source("http://www.bioconductor.org/biocLite.R")
biocLite("hgu133plus2probe")
library("hgu133plus2probe")
head(hgu133plus2probe)

For custom CDF you have to download and install probe file. For example: http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/14.1.0/ensg.download/hgu133plus2hsensgprobe_14.1.0.tar.gz This is found in: http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/14.1.0/ensg.asp

Then install the probe file:

R CMD INSTALL hgu133plus2hsensgprobe_14.1.0.tar.gz

Then run these commands in R:

library(hgu133plus2hsensgprobe)
head(hgu133plus2hsensgprobe)