how to get information for a series of accession ID ?
2
0
Entering edit mode
9.7 years ago
Mo ▴ 920

Hello,

I have a series of accession and I am wondering whether there is a fast way to extract information related to each of them? For example, I want to extract related to the following

Instead to check them one by one using GEO. for example I want to get info whether they are RNAseq or something else, tissue, Characteristics etc.

GSM1387801
GSM1387802
GSM1387803
GSM1387804
GSM1409334
GSM1409335

I tried in another way to first import all samples based on a platform ID in R based on previous question Querying Ncbi Geo By Platform Id. However, I am afraid that this way does not lead me to what I really want and not working properly Lets imagine this as my platform ID GPL17301 consisting of 53 samples

I did

library(GEOquery)
gpl <- getGEO("GPL17301")
length(Meta(gpl)$series_id)
# It only showed 10?!
python R unix linux • 2.8k views
ADD COMMENT
1
Entering edit mode
9.7 years ago
Ram 44k

You can use the R package GeoQuery to do this.

Ref:A: Extract Expression Profiles Of Specific Region From Geo

Protip: Always search existing questions before you start a question. Basic questions such as these are usually already addressed.

ADD COMMENT
0
Entering edit mode

@Ram thanks for your message but seems like you did not read my question. I am not looking to extract expression profiles . Please read my question carefully, if I can do that with GeoQuery , can you please provide me with an example based on the given accession IDs? I read the manual but I could not use it for this purpose

ADD REPLY
0
Entering edit mode

In my experience, most APIs can be used to fetch records, and the data you need seems to be part of the record. APIs usually parse the record into accessible formats for programmable analysis.

I have no experience with GeoQuery, but I was extrapolating from my experience and from common sense. I apologize f my answer was not specific to your query - I prefer showing people the path than taking them to their destination.

ADD REPLY
0
Entering edit mode
9.7 years ago
A. Domingues ★ 2.7k

From your code I can see that gpl does not contain the information you are looking for:

> str(gpl)
Formal class 'GPL' [package "GEOquery"] with 2 slots
  ..@ dataTable:Formal class 'GEODataTable' [package "GEOquery"] with 2 slots
  .. .. ..@ columns:'data.frame':    0 obs. of  0 variables
  .. .. ..@ table  :'data.frame':    0 obs. of  0 variables
  ..@ header   :List of 14
  .. ..$ contact_country : chr "USA"
  .. ..$ contact_name    : chr ",,GEO"
  .. ..$ data_row_count  : chr "0"
  .. ..$ distribution    : chr "virtual"
  .. ..$ geo_accession   : chr "GPL17301"
  .. ..$ last_update_date: chr "Jun 17 2013"
  .. ..$ organism        : chr "Homo sapiens"
  .. ..$ sample_id       : chr [1:53] "GSM1166038" "GSM1166039" "GSM1166040" "GSM1166041" ...
  .. ..$ series_id       : chr [1:10] "GSE46876" "GSE48033" "GSE49477" "GSE50057" ...
  .. ..$ status          : chr "Public on Jun 17 2013"
  .. ..$ submission_date : chr "Jun 17 2013"
  .. ..$ taxid           : chr "9606"
  .. ..$ technology      : chr "high-throughput sequencing"
  .. ..$ title           : chr "Ion Torrent PGM (Homo sapiens)"

You can however retrieve all samples for that platform with:

library(GEOquery)
gpl <- getGEO("GPL17301")
head(gpl@header$sample_id)
[1] "GSM1166038" "GSM1166039" "GSM1166040" "GSM1166041" "GSM1166042"
[6] "GSM1166043"

From there we can extract the information for each sample:

samples <- gpl@header$sample_id
gps <- getGEO(samples[1])

With another str I figured out where the information regarding the library prep (and anything else) is:

str(gps)
gps@header$library_strategy
[1] "RNA-Seq"

Two notes:

  • It will take some digging because not all records follow the same rules. They should, but I did experience some inconsistencies trying to find information in the past. Since this appears to have been done for the same entity you might be lucky and won't need to resort to greps.
  • str() is my best fRiend.
ADD COMMENT
0
Entering edit mode

@fridaymeetssunday Thanks for your example, however there is one problem. In a platform you can find about RNA-Seq but for each sample, I need not for all, on the other hand, for example I also want to know that from which tissue they are coming from and some more info. This is my main question that I extract them simply. If you have any clue, I will really appreciate your help

ADD REPLY
0
Entering edit mode

Maybe I misunderstood you want but if you do str() as indicated in my example, you will see the structure of the information for that sample (and it should be same for all samples). Then, again exemplified in my previous code, you will see that the tissue information can be obtained with gps@header$characteristics_ch1. Even more specific, with gps@header$characteristics_ch1[1] which tells you:

[1] "tissue: Pooled Tumor"

Other information you need can be traced with str().

ADD REPLY
0
Entering edit mode

@fridaymeetssunday thanks! This is only for one sample, if I am going to do it, it is the same as I use the website , I am thinking of an automatic way to extract all info for all samples instead

ADD REPLY
0
Entering edit mode

I saw your post in the bioconductor forum and you are almost there. I suggest you replace the lapply with a for loop. Now it is time to put your R skills to use.

ADD REPLY
0
Entering edit mode

@fridaymeetssunday why should I use the loop? if your data is huge then loop is the worst you might do since it takes more time to perform it; normally people avoid loop :-p Sean Davis commented to use GEOmetadb while I have no idea how it works and it is so complicated package with no exact example (at least to me) ! I feel like, I should write something myself.

ADD REPLY

Login before adding your answer.

Traffic: 1553 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6