Question

Extracting Information From Geo Soft Files

0

Entering edit mode

12.4 years ago

Layla ▴ 50

Hi: I am working with the soft files obtained from GEO from specific diseases. The question that I have is that if there is an specific way or tag inside the GEO soft files that points to me when is a set of control samples or disease samples (GSM in this case). Can I obtain that data using some external package like GEOquery? Thanks

geo • 12k views

ADD COMMENT • link updated 12.4 years ago by Neilfws 49k • written 12.4 years ago by Layla ▴ 50

score 2 · Answer 1 · 2012-07-18

It sounds like you should be searching for the GSE (or GDS) rather than the GSMs directly. The experiment-level information is in the GSE (or GDS) while the sample-level information is in the GSM file. GEOquery can deal with all GEO data types, so I would recommend giving it a try before embarking on a wheel reinvention.

Ram · Answer 2 · 2012-07-18

The short answer to this question is no. You cannot determine whether a sample is control/normal or diseased from the GEO database, because it does not use a controlled vocabulary: in other words, submitters can describe their samples any way they like, using arbitrary free text. This is, in my opinion, one of the great failings of GEO.

Take for example series GSE4183: "Inflammation, adenoma and cancer: objective classification of colon biopsy specimens with gene expression signature." In this study, a sample from normal colon is described like this:

!Sample_title = colon_normal_1024
!Sample_description = total RNA extracted from cells obtained using biopsy of the colon in a healthy control

A sample of cancerous colon:

!Sample_title = colon_adenoma_1115
!Sample_description = total RNA extracted from cells obtained using biopsy in a patient having colon adenoma

This is actually quite a good example. In other series, titles and descriptions frequently do not follow any pattern and are not informative.

So in summary: there is no specific tag for "control" or "disease". The best you can do is parse sample title and/or description (using GEOquery or some other software) and hope that the text is informative.

Ram · Answer 3 · 2012-07-18

Don't know if it can help you, but we wrote these scripts a few years ago to 1) extract data and notes from series_matrix files, and then to kind of transform the generated .notes file into a phenoData file.

Usage is (you have to install the PerlIO::gzip module) :

extractData.pl GSExxxx_series_matrix.txt.gz output_dir

this will extract a GSExxxx.data and a GSExxxx.notes files into 'output_dir'. The GSExxx.data file is easily imported in R with read.table(). Colnames are GSM values.

To get a phenoData file from the .notes file, do :

notes2pData.pl GSExxxx.notes

hope it still works !

Julien

extractData.pl:

#!/usr/bin/perl -w

use strict;
use warnings;
use PerlIO::gzip;

open( FILE, "<:gzip", "$ARGV[0]" );

my $outputDir = $ARGV[1];

$ARGV[0] =~ /(GSE\d+)-?(GPL\d+)?_series_matrix.txt.gz$/;
my $GSE = $1;
my $GPL = $2;

my($prefix);
if(defined($GPL)) {
    $prefix = "$GSE-$GPL"
}
else {
    $prefix = "$GSE";
}

print $prefix;

open(MATRIX, ">$outputDir/$prefix.data");
open(NOTES, ">$outputDir/$prefix.notes");

my ( $gseTitle, $gseDescription, $gsePMID, @sampleIDs, @sampleTitles,
    @sampleDescriptions, @sampleSrcCh1, @samplePlatforms, @sampleOrganism,
    @platformIDs );    

my $table = 0;

while(my $line = <FILE>) {

    if($line =~ /^\![sS]eries_matrix_table_end/) {
        $table = 0;
    }
    if($table == 1) {
        $line =~ s/[\"\#]//g;
        print MATRIX $line;
    }


    $line =~ s/[\r\n]//g;

    if($line =~ /^\!Series_title[\s\t]+\"(.+)\"/) {
        $gseTitle = $1;
    }
    elsif($line =~ /^\!Series_summary[\s\t]+\"(.+)\"/) {
        $gseDescription .= $1;
    }
    elsif($line =~ /^\!Sample_geo_accession[\s\t]+\"(.+)\"/) {
        my $tt = $1;
        $tt =~ s/\"//g;
        @sampleIDs = split(/\t/, $tt);
    }
    elsif($line =~ /^\!Series_pubmed_id[\s\t]+\"(.+)\"/)  {
        $gsePMID = $1;
    }
    elsif($line =~ /^\!Series_platform_id[\s\t]+\"(.+)\"/) {
        push @platformIDs, $1;
    }
    elsif($line =~ /^\!Sample_title/) {
        $line =~ s/\"//g;
        my @t = split(/\t/, $line);
        shift @t;
        @sampleTitles = @t;
    }
    elsif($line =~ /^\!Sample_source_name_ch1/) {
        $line =~ s/\"//g;
        my @t = split(/\t/, $line);
        shift @t;
        @sampleSrcCh1 = @t;
    }
    elsif($line =~ /^\!Sample_organism_ch1/) {
        $line =~ s/\"//g;
        my @t = split(/\t/, $line);
        shift @t;
        @sampleOrganism = @t;
    }
    elsif($line =~ /^\!Sample_description/) {
        $line =~ s/\"//g;
        my @t = split(/\t/, $line);
        shift @t;
        for(my $i = 0; $i < scalar(@t); $i++) {
            $sampleDescriptions[$i] .= $t[$i]." ";
        }
    }
    elsif($line =~ /^\!Sample_platform_id/) {
        $line =~ s/\"//g;
        my @t = split(/\t/, $line);
        shift @t;
        @samplePlatforms = @t;
    } 
    elsif($line =~ /^\![sS]eries_matrix_table_begin/) {
        $table = 1;
    }
}

#( $gseTitle, $gseDescription, $gsePMID, @sampleIDs, @sampleTitles,
#   @sampleDescriptions, @sampleSrcCh1, @samplePlatforms, @sampleOrganism,
#   @platformIDs )

print NOTES "GSE_ID = $GSE\n", "GSE_TITLE = $gseTitle\n", "GSE_DESC = $gseDescription\n", "GSE_PMID = $gsePMID\n";
if(defined($GPL)) {
    print NOTES "PLATFORM = $GPL\n";
}
else {
    print NOTES "PLATFORM = $samplePlatforms[0]\n";
}
print NOTES     "NB_SAMPLES = ".scalar(@sampleIDs)."\n";
print NOTES "\n";
print NOTES "SAMPLE_IDS = ".join("\t",@sampleIDs)."\n",
            "SAMPLE_TITLES = ".join("\t",@sampleTitles)."\n",
            "SAMPLE_ORGANISMS = ".join("\t", @sampleOrganism)."\n",
            "SAMPLE_SRC_CH1 = ".join("\t", @sampleSrcCh1)."\n",
            "SAMPLE_DESC = ".join("\t", @sampleDescriptions)."\n";

close(MATRIX);
close(NOTES);

Ram · Answer 4 · 2012-07-18

and notes2pData.pl:

#!/usr/bin/perl -w

use strict;
use warnings;

open(FILE,"$ARGV[0]");

$ARGV[0] =~ /^(GSE\d+)-?(GPL\d+)?\.notes$/;
my $GSE = $1;
my $GPL = $2;

my($prefix);
if(defined($GPL)) {
        $prefix = "$GSE-$GPL"
}
else {
        $prefix = "$GSE";
}

print $prefix;

my(@ids);
my(%hash);
my $go = 0;

while(my $line = <FILE>) {
    $line =~ s/[\r\n]//g;
    print "+";
    if($line =~ /^SAMPLE\_IDS\s=\s(.+)$/) {
        @ids = split(/\t/,$1);
        print join("\t",@ids)."\n";     
        $go = 1;
    }

    if($go == 1) {
        $line =~ /^(.+)\s=\s(.+)$/;
        my @t = split(/\t/,$2);
        print join("\t",@t)."\n";
        $hash{$1} = \@t;
    }
}

open(PDATA,">$prefix.pData");

my @keys = keys(%hash);
print PDATA "ID\t".join("\t",@keys)."\n";
for(my $i=0;$i<scalar(@ids);$i++) {
    print PDATA $ids[$i];
    for(my $j=0;$j<scalar(@keys);$j++) {
        print PDATA "\t",$hash{$keys[$j]}[$i];
    }
    print PDATA "\n";
}

close(PDATA);
close(FILE);