Entering edit mode
4.2 years ago
bjwiley23
▴
40
Hi,
Normally when I use getGEO for microarray experiment, the sampleNames() defaults to the "GSM**" name but not for GEO="GSE6691". It's defaulting the sampleNames to the data for probe 201088_at. Not sure why this is so. It works perfectly fine for other GEO IDs such as "GSE19147". Why is this? Here is code and output. There is also supposed to be 22283 features but I only get 21667. I get same parsing errors for "GSE19147" but I get correct sampleNames and correct 22283 features.
Input:
my.gse2 <- "GSE6691"
my.geo.gse2 <- getGEO(GEO=my.gse2,
destdir="./geo_downloads",
getGPL=T)
Output:
Parsed with column specification:
cols(
.default = col_double(),
`201088_at` = col_character()
)
See spec(...) for full column specifications.
Using locally cached version of GPL96 found here:
./geo_downloads/GPL96.soft
Warning: 68 parsing failures.
row col expected actual file
22216 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
22217 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
22218 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
22219 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
22220 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
..... ....... .................. ......... ............
See problems(...) for more details.
Input:
my.geo.gse2
sampleNames(my.geo.gse2)
Output:
$GSE6691_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 21667 features, 56 samples
element names: exprs
protocolData: none
phenoData
sampleNames: 5.82592 6.799907 ... 5.8102 (56 total)
varLabels: title geo_accession ... data_row_count (52 total)
varMetadata: labelDescription
featureData
featureNames: 201089_at 201090_x_at ... AFFX-TrpnX-M_at (21667 total)
fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 17252022
Annotation: GPL96
[1] "5.82592" "6.799907" "5.388776" "6.630125" "7.093243" "6.426071" "6.482608" "6.363402"
[9] "5.429877" "6.561914" "7.302787" "7.182047" "6.25908" "8.237549" "6.504661" "6.386279"
[17] "6.709859" "5.855908" "6.686627" "6.401285" "6.066935" "7.919946" "6.703274" "7.679963"
[25] "8.246524" "7.534912" "7.483763" "7.917386" "5.945879" "5.479577" "5.179641" "5.507846"
[33] "5.033364" "7.497475" "8.670058" "6.941275" "7.268032" "8.177226" "7.658442" "6.746167"
[41] "5.127423" "7.357877" "7.106516" "6.029429" "6.865016" "4.837314" "6.368258" "6.591957"
[49] "5.837441" "6.472433" "6.593988" "7.668101" "5.39544" "4.960329" "6.512131" "5.8102"
Well at least I know where the problem is but as to why I am still concerned. That probe 201088_at is at line 616 which is the exact amount of probes missing from the getGEO() ExpressionSet. It is starting at line 616 of the data using the data from that line as the "header" or sample names. Where would I report something like this?
GEOquery documentation here indicates where to report issues: https://bioconductor.org/packages/release/bioc/vignettes/GEOquery/inst/doc/GEOquery.html#reporting-problems-or-bugs
FWIW it seems this might be an unexpected skip behavior in readr library's read_tsv for this particular GSE. Which might be worth reporting at the places recommended in the link above. Presuming you have downloaded and unzipped the series matrix file, using the traditional read.delim in place of read_tsv:
shows the expected result, a 22283x57 matrix.