Question

How to retrieve the genes associated to an RNA PAXgene gene expression dataset from GEO?

0

Entering edit mode

5.3 years ago

Davide Chicco ▴ 120

In the past I have been working with a gene expression dataset generated with Affymetrix and I was able to use the getBM() Bioconductor function to retrieve the genes associated to it.

These are the lines of R code I used to use:

# Gene list
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)

thisAnnotLookup <- getBM(mart=mart, attributes=c("affy_hugene_1_0_st_v1", "ensembl_gene_id", "gene_biotype", "external_gene_name"), filter="affy_hugene_1_0_st_v1", values=rownames(thisGSetExprss), uniqueRows=TRUE)

And everything worked. Now I am working on another microarray dataset, generated with PAXgene, and I am trying to understand how to retrieve the genes associated to it. The platform they used is RNG-MRC_HU25k_STRASBOURG, that I have not found in BioMart.

What can I do?

Thanks!

-- Davide

EDIT: These are the fields present in my GEO variable in R

> str(gset)
Formal class 'ExpressionSet' [package "Biobase"] with 7 slots
  ..@ experimentData   :Formal class 'MIAME' [package "Biobase"] with 13 slots
  .. .. ..@ name             : chr "Yvan,,Devaux"
  .. .. ..@ lab              : chr ""
  .. .. ..@ contact          : chr "yvan.devaux@lih.lu"
  .. .. ..@ title            : chr "Integrated Network and Microarray Analysis to Identify New Biomarkers in Ischemic Heart Disease"
  .. .. ..@ abstract         : chr "A significant proportion of acute myocardial infarction (MI) patients develop heart failure (HF). Early identif"| __truncated__
  .. .. ..@ url              : chr "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE11947"
  .. .. ..@ pubMedIds        : chr "20462429\n20414696\n20300185"
  .. .. ..@ samples          : list()
  .. .. ..@ hybridizations   : list()
  .. .. ..@ normControls     : list()
  .. .. ..@ preprocessing    : list()
  .. .. ..@ other            :List of 23
  .. .. .. ..$ contact_address        : chr "120 route d'Arlon"
  .. .. .. ..$ contact_city           : chr "Luxembourg"
  .. .. .. ..$ contact_country        : chr "Luxembourg"
  .. .. .. ..$ contact_email          : chr "yvan.devaux@lih.lu"
  .. .. .. ..$ contact_institute      : chr "LIH"
  .. .. .. ..$ contact_laboratory     : chr "Cardiovascular Research Unit"
  .. .. .. ..$ contact_name           : chr "Yvan,,Devaux"
  .. .. .. ..$ contact_zip/postal_code: chr "1150"
  .. .. .. ..$ geo_accession          : chr "GSE11947"
  .. .. .. ..$ last_update_date       : chr "Mar 19 2012"
  .. .. .. ..$ overall_design         : chr "The 32 patients of this study were divided in 2 groups corresponding to the extreme quartiles of FE values. The"| __truncated__
  .. .. .. ..$ platform_id            : chr "GPL1947"
  .. .. .. ..$ platform_taxid         : chr "9606"
  .. .. .. ..$ pubmed_id              : chr "20462429\n20414696\n20300185"
  .. .. .. ..$ relation               : chr "BioProject: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA105803"
  .. .. .. ..$ sample_id              : chr "GSM302309 GSM302310 GSM302311 GSM302312 GSM302313 GSM302314 GSM302315 GSM302316 GSM302317 GSM302318 GSM302319 G"| __truncated__
  .. .. .. ..$ sample_taxid           : chr "9606"
  .. .. .. ..$ status                 : chr "Public on May 25 2010"
  .. .. .. ..$ submission_date        : chr "Jul 01 2008"
  .. .. .. ..$ summary                : chr "A significant proportion of acute myocardial infarction (MI) patients develop heart failure (HF). Early identif"| __truncated__
  .. .. .. ..$ supplementary_file     : chr "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE11nnn/GSE11947/suppl/GSE11947_RAW.tar"
  .. .. .. ..$ title                  : chr "Integrated Network and Microarray Analysis to Identify New Biomarkers in Ischemic Heart Disease"
  .. .. .. ..$ type                   : chr "Expression profiling by array"
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 2
  .. .. .. .. .. ..$ : int [1:3] 1 0 0
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ assayData        :<environment: 0x562675095e10=""> 
  ..@ phenoData        :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
  .. .. ..@ varMetadata      :'data.frame': 69 obs. of  1 variable:
  .. .. .. ..$ labelDescription: chr [1:69] NA NA NA NA ...
  .. .. ..@ data             :'data.frame': 32 obs. of  69 variables:
  .. .. .. ..$ title                   : Factor w/ 32 levels "BL 708","DA 706",..: 27 26 18 6 11 28 7 2 3 17 ...
  .. .. .. ..$ geo_accession           : chr [1:32] "GSM302309" "GSM302310" "GSM302311" "GSM302312" ...
  .. .. .. ..$ status                  : Factor w/ 1 level "Public on May 25 2010": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ submission_date         : Factor w/ 1 level "Jul 01 2008": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ last_update_date        : Factor w/ 1 level "May 25 2010": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ type                    : Factor w/ 1 level "RNA": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ channel_count           : Factor w/ 1 level "2": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ source_name_ch1         : Factor w/ 12 levels "BL 708","HJ687",..: 11 12 6 12 12 12 12 12 12 5 ...
  .. .. .. ..$ organism_ch1            : Factor w/ 1 level "Homo sapiens": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ characteristics_ch1     : Factor w/ 12 levels "Labeling_reference:BL 708",..: 11 12 6 12 12 12 12 12 12 5 ...
  .. .. .. ..$ characteristics_ch1.1   : Factor w/ 4 levels "Extraction_reference: PAXgene",..: 1 4 1 4 4 4 4 4 4 1 ...
  .. .. .. ..$ characteristics_ch1.2   : Factor w/ 15 levels "Sample_reference: BL 708",..: 11 13 6 14 13 15 14 15 14 5 ...
  .. .. .. ..$ characteristics_ch1.3   : Factor w/ 13 levels "Subject_reference: BL 708",..: 11 12 6 12 12 13 12 13 12 5 ...
  .. .. .. ..$ characteristics_ch1.4   : Factor w/ 4 levels "","Tissue: blood",..: 3 2 3 2 2 2 2 2 2 3 ...
  .. .. .. ..$ characteristics_ch1.5   : Factor w/ 3 levels "","Extraction_amount: 10.0",..: 3 3 3 3 3 2 3 2 3 3 ...
  .. .. .. ..$ characteristics_ch1.6   : Factor w/ 2 levels "","Extraction_amount: 10.0": 2 2 2 2 2 1 2 1 2 2 ...
  .. .. .. ..$ molecule_ch1            : Factor w/ 1 level "total RNA": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ extract_protocol_ch1    : Factor w/ 2 levels "Qiagen","Trizol": 1 2 1 2 2 2 2 2 2 1 ...
  .. .. .. ..$ label_ch1               : Factor w/ 1 level "Cy3, Cy5": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ label_protocol_ch1      : Factor w/ 1 level "Ambion": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ taxid_ch1               : Factor w/ 1 level "9606": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ source_name_ch2         : Factor w/ 22 levels "DA 706","FC 732",..: 19 16 19 4 8 17 5 1 2 19 ...
  .. .. .. ..$ organism_ch2            : Factor w/ 1 level "Homo sapiens": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ characteristics_ch2     : Factor w/ 22 levels "Labeling_reference:DA 706",..: 19 16 19 4 8 17 5 1 2 19 ...
  .. .. .. ..$ characteristics_ch2.1   : Factor w/ 3 levels "Extraction_reference: L62 VN",..: 3 2 3 2 2 2 2 2 2 3 ...
  .. .. .. ..$ characteristics_ch2.2   : Factor w/ 25 levels "Sample_reference: DA 706",..: 21 16 21 4 8 17 5 1 2 20 ...
  .. .. .. ..$ characteristics_ch2.3   : Factor w/ 23 levels "Subject_reference: DA 706",..: 19 16 19 4 8 17 5 1 2 19 ...
  .. .. .. ..$ characteristics_ch2.4   : Factor w/ 2 levels "Tissue: blood",..: 1 1 1 2 1 2 2 2 2 2 ...
  .. .. .. ..$ characteristics_ch2.5   : Factor w/ 2 levels "Extraction_amount: 10.0",..: 2 2 2 2 2 1 2 1 2 2 ...
  .. .. .. ..$ characteristics_ch2.6   : Factor w/ 2 levels "","Extraction_amount: 10.0": 2 2 2 2 2 1 2 1 2 2 ...
  .. .. .. ..$ molecule_ch2            : Factor w/ 1 level "total RNA": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ extract_protocol_ch2    : Factor w/ 2 levels "Qiagen","Trizol": 2 1 2 1 1 1 1 1 1 1 ...
  .. .. .. ..$ label_ch2               : Factor w/ 1 level "Cy3, Cy5": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ label_protocol_ch2      : Factor w/ 1 level "Ambion": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ taxid_ch2               : Factor w/ 1 level "9606": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ hyb_protocol            : Factor w/ 1 level "Agilent : 750.0 ng at 60 degree_C during 17 hours": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ scan_protocol           : Factor w/ 1 level "Scanned on an GenePix 4000B fluorescent scanner.": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ scan_protocol.1         : Factor w/ 1 level "Image intensity data were extracted with GenePix Pro 6.0 analysis software.": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ description             : Factor w/ 18 levels "ejection fraction (EF): 20",..: 18 18 17 16 16 15 15 14 14 13 ...
  .. .. .. ..$ description.1           : Factor w/ 3 levels "group:  B","group: A",..: 3 3 3 3 3 3 3 3 3 3 ...
  .. .. .. ..$ data_processing         : Factor w/ 1 level "Lowess non linear normalization": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ platform_id             : Factor w/ 1 level "GPL1947": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ contact_name            : Factor w/ 1 level "Yvan,,Devaux": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ contact_email           : Factor w/ 1 level "yvan.devaux@lih.lu": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ contact_laboratory      : Factor w/ 1 level "Cardiovascular Research Unit": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ contact_institute       : Factor w/ 1 level "LIH": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ contact_address         : Factor w/ 1 level "120 route d'Arlon": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ contact_city            : Factor w/ 1 level "Luxembourg": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ contact_zip/postal_code : Factor w/ 1 level "1150": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ contact_country         : Factor w/ 1 level "Luxembourg": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ supplementary_file      : Factor w/ 32 levels "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM302nnn/GSM302309/suppl/GSM302309_L29921.gpr.gz",..: 1 2 3 4 5 6 7 8 9 10 ...
  .. .. .. ..$ supplementary_file.1    : Factor w/ 32 levels "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM302nnn/GSM302309/suppl/GSM302309_L29923.gpr.gz",..: 1 2 3 4 5 6 7 8 9 10 ...
  .. .. .. ..$ supplementary_file.2    : Factor w/ 32 levels "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM302nnn/GSM302309/suppl/GSM302309_L30105.gpr.gz",..: 1 2 3 4 5 6 7 8 9 10 ...
  .. .. .. ..$ supplementary_file.3    : Factor w/ 32 levels "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM302nnn/GSM302309/suppl/GSM302309_L30107.gpr.gz",..: 1 2 3 4 5 6 7 8 9 10 ...
  .. .. .. ..$ data_row_count          : Factor w/ 1 level "16238": 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ Extraction_amount:ch1   : chr [1:32] "10.0" "10.0" "10.0" "10.0" ...
  .. .. .. ..$ Extraction_amount:ch2   : chr [1:32] "10.0" "10.0" "10.0" "10.0" ...
  .. .. .. ..$ Extraction_reference:ch1: chr [1:32] "PAXgene" "Trizol" "PAXgene" "Trizol" ...
  .. .. .. ..$ Extraction_reference:ch2: chr [1:32] "Trizol" "PAXgene" "Trizol" "PAXgene" ...
  .. .. .. ..$ Labeling_reference:ch1  : chr [1:32] "L88-TG" "Ref" "L38 DP" "Ref" ...
  .. .. .. ..$ Labeling_reference:ch2  : chr [1:32] "Ref" "L67-SR" "Ref" "KF 692" ...
  .. .. .. ..$ RNA_quality:ch1         : chr [1:32] "null" "null" "null" "null" ...
  .. .. .. ..$ RNA_quality:ch2         : chr [1:32] "null" "null" "null" "null" ...
  .. .. .. ..$ Sample_reference:ch1    : chr [1:32] "L88-TG" "Ref" "L38 DP" "REF" ...
  .. .. .. ..$ Sample_reference:ch2    : chr [1:32] "REF" "L67-SR" "REF" "KF 692" ...
  .. .. .. ..$ Subject_reference:ch1   : chr [1:32] "L88-TG" "Ref" "L38 DP" "Ref" ...
  .. .. .. ..$ Subject_reference:ch2   : chr [1:32] "Ref" "L67-SR" "Ref" "KF 692" ...
  .. .. .. ..$ Tissue:ch1              : chr [1:32] "Blood" "blood" "Blood" "blood" ...
  .. .. .. ..$ Tissue:ch2              : chr [1:32] "blood" "blood" "blood" "Blood" ...
  .. .. ..@ dimLabels        : chr [1:2] "sampleNames" "sampleColumns"
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 1
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ featureData      :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
  .. .. ..@ varMetadata      :'data.frame': 0 obs. of  1 variable:
  .. .. .. ..$ labelDescription: chr(0) 
  .. .. ..@ data             :'data.frame': 16238 obs. of  0 variables
  .. .. ..@ dimLabels        : chr [1:2] "featureNames" "featureColumns"
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 1
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ annotation       : chr "GPL1947"
  ..@ protocolData     :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
  .. .. ..@ varMetadata      :'data.frame': 0 obs. of  1 variable:
  .. .. .. ..$ labelDescription: chr(0) 
  .. .. ..@ data             :'data.frame': 32 obs. of  0 variables
  .. .. ..@ dimLabels        : chr [1:2] "sampleNames" "sampleColumns"
  .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. .. .. ..@ .Data:List of 1
  .. .. .. .. .. ..$ : int [1:3] 1 1 0
  ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
  .. .. ..@ .Data:List of 4
  .. .. .. ..$ : int [1:3] 3 6 0
  .. .. .. ..$ : int [1:3] 2 44 0
  .. .. .. ..$ : int [1:3] 1 3 0
  .. .. .. ..$ : int [1:3] 1 0 0

RNA-Seq RNA geo R-language • 1.1k views

ADD COMMENT • link 5.3 years ago by Davide Chicco ▴ 120

0

Entering edit mode

Hey Davide, I never heard of that array, but perhaps you can get the annotation that you need from Here? - it's the main page for this array on GEO.

ADD REPLY • link 5.3 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks Kevin. I saw that page but I cannot understand how to access those data header fields. Do you know how I can do that?

ADD REPLY • link 5.3 years ago by Davide Chicco ▴ 120

score 0 · Accepted Answer · 2019-10-04

I was able to solve my own problem by just checking the getGEO() function: I realized that the getGPL must be set to TRUE.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()

BiocManager::install("GEOquery")

GSE_code <- "GSE11947"
getGEOSuppFiles(GSE_code) 
gset <- getGEO(GSE_code, GSEMatrix =TRUE, getGPL=TRUE)