How can I get the sequence corresponding to each probe id of affymetrix gene chip?
How can I get the sequence corresponding to each probe id of affymetrix gene chip?
These are [thankfully] available on the Afffymetrix / ThermoFisher website. The documentation for Affymetrix arrays is comprehensive.
For example, for HuGene ST 1.0 / 2.0, the available documentation can be found here: Human Gene ST Arrays - Support Materials
The specific files that you'll want are the Sequence Files
head HuGene-2_0-st-v1.hg19.probe.fa
>probe:HuGene-2_0-st-v1:1909182-16657436;573:1184; ProbeID=1909182; TranscriptClusterID=16657436; Assembly=build-GRCh37/hg19; Seqname=chr1; Start=12200; Stop=12224; Strand=+; Sense; category=main
CCTAGGTTGTGAGAGAAGTTGATGC
>probe:HuGene-2_0-st-v1:1481686-16657436;257:919; ProbeID=1481686; TranscriptClusterID=16657436; Assembly=build-GRCh37/hg19; Seqname=chr1; Start=12616; Stop=12640; Strand=+; Sense; category=main
GAAGGGCATGCCTGGCATCACCACA
>probe:HuGene-2_0-st-v1:2398055-16657436;1010:1487; ProbeID=2398055; TranscriptClusterID=16657436; Assembly=build-GRCh37/hg19; Seqname=chr1; Start=12644; Stop=12668; Strand=+; Sense; category=main
TCTGCAGCTCTGGAGACCTGATGCT
>probe:HuGene-2_0-st-v1:403478-16657436;477:250; ProbeID=403478; TranscriptClusterID=16657436; Assembly=build-GRCh37/hg19; Seqname=chr1; Start=12668; Stop=12692; Strand=+; Sense; category=main
TGTGATCCAAGTCGGCCGTCGTCTT
>probe:HuGene-2_0-st-v1:1579074-16657436;925:979; ProbeID=1579074; TranscriptClusterID=16657436; Assembly=build-GRCh37/hg19; Seqname=chr1; Start=12669; Stop=12693; Strand=+; Sense; category=main
GTGTGATCCAAGTCGGCCGTCGTCT
You should be able to link these back to the original gene via the TranscriptClusterID
. If you have done your annotation via some automated R package, then you may still have this ID.
By the way, if you have already used an R package for annotation, then it may already have a function that provides the sequences - check for that.
Kevin
Thank you kevin for your quick response, it really helps me, but still i have some doubts, i am using gene expression dataset [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array, i need probe level data for further analysis , is this available publically. i want to analyse a cell line gene expression data with probe level sequence of that particular cell line. Gene expression i can get from GEO, from where i get probe level input sequence of each experiment, is there any dataset ? perfect match sequence i can collect from Afffymetrix / ThermoFisher website, from where i get the mismatch sequence
To get probe-level expression values, you should download the raw data CEL files from the GEO and then re-process them with the oligo package. When background correcting, normalising, and transforming the data with the rma()
function, specify target="probeset". This will give you virtual probe-level expression values. See here: [HuEx-1_0-st] Affymetrix Human Exon 1.0 ST Array [transcript (gene) version]
Other information can be found on the page linked by cpad.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
which chip? Affymetrix markets several chips.
[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array, i need probe level data for further analysis , is this available publically. i want to analyse a cell line gene expression data with probe level sequence of that particular cell line. Gene expression i can get from GEO, from where i get probe level input sequence of each experiment, is there any dataset ? perfect match sequence i can collect from Afffymetrix / ThermoFisher website, from where i get the mismatch sequence
try here: http://www.affymetrix.com/support/technical/byproduct.affx?product=hg-u133-plus and package in R/Bioc: https://bioconductor.org/packages/release/data/annotation/html/pd.hg.u133.plus.2.html
Hello, i am using gene expression dataset [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array, i download CEL files from GEO and normalise it using RMA, some genes are found in multiple times while analysing the data: for eg: gene 'DDR1' has 3 expression values , how can i choose an expression of DDR1 from these 3 values, by using BiomaRt i can get chromosome details of these genes, but the result obtained from getBM contains more duplicate entries, is there is any other way to connect the genes with its chromosome details.
Please use
ADD COMMENT
orADD REPLY
to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.Yes, more than 1 probe can map to the same gene. When you normalise the data from the CEL file stage via the
rma()
function, you can typically 'summarise' expression values over the individual probes or over full transcripts by modifying the target parameter that is passed torma()
can you write one example , i didn't get you the 'summarise' expression, averaging is not a good option for getting a full transcripts of a gene is it so?
At which GEO record are you looking? They usually provide the expression 'summarised' over each transcript.
When we say 'summarised', we refer to the way in which the expression values are calculated. The usual options are: