Question

Extract Expression Profiles Of Specific Region From Geo

4

Entering edit mode

12.7 years ago

Leo ▴ 130

Hi

Is there any tools or script that can extract the expression profiles of specific region from GEO datasets?

such as I want to extract the expression profiles in position from 12222 to 18000 in Eschierichia coli K-12 from GEO datasets.

Thanks.

geo gene • 3.0k views

ADD COMMENT • link updated 12.6 years ago by Neilfws 49k • written 12.7 years ago by Leo ▴ 130

score 9 · Answer 1 · 2012-03-07

Short answer: this is not straightforward.

Longer answer: it depends how well the GEO platform is annotated and in particular, whether you can access the coordinates of the probesets on the platform.

Here's an example using the R/Bioconductor package GEOquery. First, we download data from GEO for the experiment GSE1154, which examined expression changes due to introduction of F-plasmid in 2 strains of E. coli K-12.

library(GEOquery)
gse <- getGEO("GSE1154")

We can extract the raw expression data:

e <- exprs(gse$GSE1154_series_matrix.txt.gz)
head(e[, 1:4])

              GSM19356 GSM19370 GSM19371 GSM19372
aas_b2836_st     629.1    343.6    391.3    465.4
aat_b0885_st     403.5    346.8    341.1    398.4
abc_b0199_st    1110.4    673.4    588.0    625.8
abrB_b0715_st    165.9    141.7    249.4    251.2
accA_b0185_st   1232.2   1184.3   1040.4    931.4
accB_b3255_st   5580.1   5915.2   4312.1   3624.4

And the phenotype data:

pdata <- pData(featureData(gse$GSE1154_series_matrix.txt.gz))

The pdata contains a column named "Target Description" with entries that begin:

/start=2971885 /end=2974036

So in this case, we can extract starts and ends; I like to use the stringr package for this:

library(stringr)
pdata$start <- as.numeric(str_match(pdata$`Target Description`,
                                    "start=([0-9]+)")[,2])
pdata$end   <- as.numeric(str_match(pdata$`Target Description`,
                                    "end=([0-9]+)")[,2])
head(pdata[, c(1, 16:17)])

                         ID   start     end
aas_b2836_st   aas_b2836_st 2971885 2974036
aat_b0885_st   aat_b0885_st  925951  926655
abc_b0199_st   abc_b0199_st  221622  222645
abrB_b0715_st abrB_b0715_st  745946  747037
accA_b0185_st accA_b0185_st  208621  209580
accB_b3255_st accB_b3255_st 3403073 3403543

Now, we should really use something like IRanges to find probesets that overlap specific coordinates. However for now, we'll just look for probesets with either start or end between 2 specified coordinates:

probes <- subset(pdata, (start >= 12222 & start <= 18000 |
                         end >= 12222 & end <= 18000))
dim(probes)
[1] 12 17

In this case, the row names of the expression matrix match those in pdata and are in the same order. So we can just pull out the subset of expression values for probesets within the specified coordinate range:

e.subset <- subset(e, rownames(e) %in% rownames(probes))
dim(e.subset)
[1] 12  8

head(e.subset[, 1:4])

                           GSM19356 GSM19370 GSM19371 GSM19372
dnaJ_b0015_st                2173.9   2106.6   4949.0   3932.2
dnaK_b0014_st                8209.8   5475.1  15698.4  15506.9
gef_b0018_st                   25.6     40.2     46.2    103.6
IG_12_11787_15868_rev_f_st    211.3    392.5   1068.8    466.0
IG_13_14080_14167_fwd_st      970.0   1329.6   1884.1   2417.2
IG_14_15299_15444_fwd_f_st    619.5    862.3   1269.7   1353.9

If the platform annotation does not contain probeset coordinates, you will have to find some other way of linking whatever is annotated (gene names, symbols, IDs) to coordinates, or hope that there is an annotation file available online.