Hi
Is there any tools or script that can extract the expression profiles of specific region from GEO datasets?
such as I want to extract the expression profiles in position from 12222 to 18000 in Eschierichia coli K-12 from GEO datasets.
Thanks.
Hi
Is there any tools or script that can extract the expression profiles of specific region from GEO datasets?
such as I want to extract the expression profiles in position from 12222 to 18000 in Eschierichia coli K-12 from GEO datasets.
Thanks.
Short answer: this is not straightforward.
Longer answer: it depends how well the GEO platform is annotated and in particular, whether you can access the coordinates of the probesets on the platform.
Here's an example using the R/Bioconductor package GEOquery. First, we download data from GEO for the experiment GSE1154, which examined expression changes due to introduction of F-plasmid in 2 strains of E. coli K-12.
library(GEOquery)
gse <- getGEO("GSE1154")
We can extract the raw expression data:
e <- exprs(gse$GSE1154_series_matrix.txt.gz)
head(e[, 1:4])
GSM19356 GSM19370 GSM19371 GSM19372
aas_b2836_st 629.1 343.6 391.3 465.4
aat_b0885_st 403.5 346.8 341.1 398.4
abc_b0199_st 1110.4 673.4 588.0 625.8
abrB_b0715_st 165.9 141.7 249.4 251.2
accA_b0185_st 1232.2 1184.3 1040.4 931.4
accB_b3255_st 5580.1 5915.2 4312.1 3624.4
And the phenotype data:
pdata <- pData(featureData(gse$GSE1154_series_matrix.txt.gz))
The pdata contains a column named "Target Description" with entries that begin:
/start=2971885 /end=2974036
So in this case, we can extract starts and ends; I like to use the stringr package for this:
library(stringr)
pdata$start <- as.numeric(str_match(pdata$`Target Description`,
"start=([0-9]+)")[,2])
pdata$end <- as.numeric(str_match(pdata$`Target Description`,
"end=([0-9]+)")[,2])
head(pdata[, c(1, 16:17)])
ID start end
aas_b2836_st aas_b2836_st 2971885 2974036
aat_b0885_st aat_b0885_st 925951 926655
abc_b0199_st abc_b0199_st 221622 222645
abrB_b0715_st abrB_b0715_st 745946 747037
accA_b0185_st accA_b0185_st 208621 209580
accB_b3255_st accB_b3255_st 3403073 3403543
Now, we should really use something like IRanges to find probesets that overlap specific coordinates. However for now, we'll just look for probesets with either start or end between 2 specified coordinates:
probes <- subset(pdata, (start >= 12222 & start <= 18000 |
end >= 12222 & end <= 18000))
dim(probes)
[1] 12 17
In this case, the row names of the expression matrix match those in pdata and are in the same order. So we can just pull out the subset of expression values for probesets within the specified coordinate range:
e.subset <- subset(e, rownames(e) %in% rownames(probes))
dim(e.subset)
[1] 12 8
head(e.subset[, 1:4])
GSM19356 GSM19370 GSM19371 GSM19372
dnaJ_b0015_st 2173.9 2106.6 4949.0 3932.2
dnaK_b0014_st 8209.8 5475.1 15698.4 15506.9
gef_b0018_st 25.6 40.2 46.2 103.6
IG_12_11787_15868_rev_f_st 211.3 392.5 1068.8 466.0
IG_13_14080_14167_fwd_st 970.0 1329.6 1884.1 2417.2
IG_14_15299_15444_fwd_f_st 619.5 862.3 1269.7 1353.9
If the platform annotation does not contain probeset coordinates, you will have to find some other way of linking whatever is annotated (gene names, symbols, IDs) to coordinates, or hope that there is an annotation file available online.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Very nice reply! Small shorthand: fData(XXX) = pData(featureData(XXX))
great thanks, I think it now can solved my problem.