I downloaded a set of summaries about gene expression data sets from NCBI (for example of file format, if you go to https://www.ncbi.nlm.nih.gov/gds and searched for GDS5879, and downloaded the result through Send to file -> Summary (text)).
When you open this in excel, you can see the result is like this:
Pulmonary CDC11c+ cells from young and middle-age animals
Analysis of pulmonary CDC11c+ cells from 6-8 week and 10-13 month old C57BL/6 animals. CDC11c+ cells are key modulators of the immune response in the lung. Results provide insight into molecular mechanisms underlying the decline in immune function associated with aging.
Organism: Mus musculus
Type: Expression profiling by array, count, 2 age sets
Platform: GPL6885 Series: GSE71868 8 Samples
FTP download: GEO ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS5nnn/GDS5879/
DataSet Accession: GDS5879 ID: 5879Normal liver developmental stages: embryonic and postnatal
Analysis of hepatoblasts, immature hepatocytes and hepatocytes from livers at different developmental timepoints (embryonic day 14, embryonic day 18, post-natal day 5, post-natal day 56). Results provide insight into molecular mechanisms underlying normal liver development.
Organism: Mus musculus
Type: Expression profiling by array, log2 ratio, 4 age, 3 cell type, 2 development stage sets Platform: GPL7202 Series: GSE65063 11 Samples
FTP download: GEO (TXT) ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS5nnn/GDS5818/
DataSet Accession: GDS5818 ID: 5818
I wanted to parse this data so I have a table of Title + "\t" + Description + "\t" + Organism + "\t" + Type + "\t" + Platform + "\t" + Series + "\t" + NumberOfSamples + "\t" + DatasetAccession.
The file doesn't look badly laid out here, but when I open it in excel, there are problems. For example, when I say in my script that "Organism" is the line underneath the experiment description, sometimes the line underneath that again (Type:) is printed instead, even though the Organism line is there.
This is the code I tried:
import sys
import re
count = 1
file_open = open(sys.argv[1])
Dict1 = {}
for line in file_open:
if str(count) + "." in line:
Title = line.strip()
Dict1[Title] = []
Description = file_open.next().strip()
Dict1[Title] = [Description]
Organism = file_open.next().strip().split(":")
Dict1[Title].append(Organism[1].strip())
Type = file_open.next().strip()
split_type = Type.strip().split(",")
Dict1[Title].extend(split_type)
Platform = file_open.next().strip()
split_platform = Platform.strip().split()
Dict1[Title].extend(split_platform)
Download = file_open.next().strip()
Dict1[Title].extend(split_platform)
Dataset = file_open.next().strip().split()
Dict1[Title].extend(Dataset)
count +=1
for i in Dict1:
print i + "\t" + "\t".join(Dict1[i])
I also tried to using csv and pandas to read in the file properly.
The problem is that the output is not uniform.
For example, if I just do this:
import sys
import re
count = 1
file_open = open(sys.argv[1])
Dict1 = {}
for line in file_open:
if str(count) + "." in line:
Title = line.strip()
Dict1[Title] = []
Description = file_open.next().strip()
Dict1[Title] = [Description]
Organism = file_open.next().strip().split(":")
Dict1[Title].append(Organism[1].strip())
Type = file_open.next().strip()
split_type = Type.strip().split(",")
print split_type
The output looks like this:
Type: Expression profiling by array
Platform: GPL1261 Series: GSE32334 6 Samples
Type: Expression profiling by array
Platform: GPL8321 Series: GSE42389 12 Samples
Platform: GPL570 Series: GSE19533 10 Samples
Type: Expression profiling by array
Type: Expression profiling by array
Type: Expression profiling by array
Platform: GPL7202 Series: GSE22828 84 Samples
Type: Expression profiling by array
Platform: GPL1261 Series: GSE6290 37 Samples
Platform: GPL1261 Series: GSE22616 24 Samples
Type: Expression profiling by array
Platform: GPL1261 Series: GSE10113 12 Samples
Type: Expression profiling by array
Platform: GPL339 Series: GSE9914 12 Samples
Platform: GPL1261 Series: GSE8091 16 Samples
Type: Expression profiling by array
Type: Expression profiling by array
Platform: GPL96 Series: GSE9714 8 Samples
Platform: GPL96 Series: GSE9713 18 Samples
Platform: GPL96 Series: GSE9712 12 Samples
Platform: GPL96 Series: GSE6011 37 Samples
Type: Expression profiling by array
Type: Expression profiling by array
Type: Expression profiling by array
...even though the "Type:" line is there for the ones that are skipped. I've also tried to view as "Summary ->text" on NCBI, but that's even worse when i put it into excel.
Can anyone help me cleanly parse this data? Have other people had this problem?
Thanks. Yes I have been trying two other ways: One is a script:
The output:
Just a small question, I think the "IdList" in the above output has all the info I need, I'm just struggling to find a dictionary on NCBI that links each of these numbers to e.g. GDS5879 or GSMXXXXX or ExpressionProfilingByArray or GPLXXXX, would you know if I'm right in saying that that file should be somewhere/I should somehow to able to link those numbers to english? (e.g. I know that 301847070 is linked somewhere to sample GSM1847070).
and the other way, using the command:
The output:
Neither give me what I want (which would be the following info for GDS5879 in a tab delimited line):
Title (In the XML file from efetch, this is in DocumentSummary -> title) = Analysis of pulmonary CDC11c+ cells from 6-8 week and 10-13 month old C57BL/6 animals. CDC11c+ cells are key modulators of the immune response in the lung. Results provide insight into molecular mechanisms underlying the decline in immune function associated with aging.
Organism (DocumentSummary ->taxon) = Mus musculus
Type (e.g. RNASeq or microarray) (DocumentSummary ->gdsType) = Expression profiling by array
Platform (DocumentSummary ->GPL) = GPL6885
Series (DocumentSummary ->GSE) = GSE78168
NumberOfSamples = (DocumentSummary -> n_samples) = 8
DatasetAccession (DocumentSummary ->GDS): = GDS5879
I'm just wondering if someone could put me on the right track as to what I'm doing wrong with either of these ways, because I can pretty much see the data I want in both the methods manually, but I'm just struggling to make the link to getting code that works?
I decided to try to be more adventurous and extract the sample characteristics for each dataset. However, eFetch doesn't seem to return all samples for a data set? For example, if you look at GDS5204, there are 41 samples in the data set.
When I run: esearch -db GDS -query GDS5204'[ACCN] AND GSM[ETYP]' | efetch -format docsum | xtract -pattern DocumentSummary -element Accession
The output is less than 41 samples:
GSM1303184 GSM1303183 GSM1303182 GSM1303181 GSM1303180 GSM1303179 GSM1303178 GSM1303177 GSM1303176 GSM1303175 GSM1303174 GSM1303173 GSM1303172 GSM1303171
If I put the number "5204" into a file called "test", and run: for i in
cat test
;do esearch -db GDS -query GDS$i'[ACCN] AND GSM[ETYP]' | efetch -format docsum | xtract -pattern DocumentSummary -element GDS Accession summary title ;donethe output is:
5204 GSM1303184 106 years old Female 106 years old Female 5204 GSM1303183 105 years old Female 105 years old Female 5204 GSM1303182 104 years old Male 104 years old Male 5204 GSM1303181 103 years old Female 103 years old Female 5204 GSM1303180 94 years old Female 94 years old Female 5204 GSM1303179 93 years oldFemale 93 years oldFemale 5204 GSM1303178 92 years old Male 92 years old Male 5204 GSM1303177 91 years old Male 91 years old Male 5204 GSM1303176 91 years old Female 91 years old Female
When I run: esearch -db GDS -query GDS5204'[ACCN] AND GSM[ETYP]' | efetch -format docsum | xtract -pattern Samples -element Accession, there is no output at all.
Does this method/file not contain a complete list of samples for each data set or am I mis-understanding something?