Question

Extracting data from Biogrid mitab format

0

Entering edit mode

9.2 years ago

stackf01 ▴ 20

Hello there. I have downloaded bulk of PPI data from BioGRID in the current release section. It is in PSI - MITAB format. I have downloaded this data to add in msql database. My question is how to I get this data which is in PSI -MITAB format in a proper format to integrate the data in MySQL database?

biogrid • 3.4k views

ADD COMMENT • link updated 9.2 years ago by Jean-Karim Heriche 27k • written 9.2 years ago by stackf01 ▴ 20

score 1 · Answer 1 · 2015-09-25

1

Entering edit mode

9.2 years ago

Jean-Karim Heriche 27k

Here is the perl code I use to put PPI into an undirected graph. It is only concerned with extracting protein IDs and Pubmed refs but you can easily modify it to get the rest.

  use Graph::Undirected;
  my $G = Graph::Undirected->new();
  my %notfound;
  open (FH,"<",$BIOGRID) or die "Can't open file $BIOGRID: $!\n";
  while (my $line=<FH>) {
    chomp($line);
    my @tmpary = split(/\t/,$line);
    my @P1s = split(/\|/,$tmpary[2]);
    my ($p1) = $P1s[0]=~/:(.+)/;
    my @P2s = split(/\|/,$tmpary[3]);
    my ($p2) = $P2s[0]=~/:(.+)/;
    next if (!$p1 || !$p2 || $notfound{$p1} || $notfound{$p2});
    my $EnsemblID1 = &get_EnsemblID($p1); # This is a function to convert to Ensembl IDs
    my $EnsemblID2 = &get_EnsemblID($p2);
    if ($EnsemblID1 && $EnsemblID2) {
      $G->add_edge($EnsemblID1,$EnsemblID2);
      if ($tmpary[8]) {
        my ($pmid) = $tmpary[8];
        $G->set_edge_attribute($EnsemblID1,$EnsemblID2,'pubmed',$pmid);
      }
    }
    else {
      if (!$EnsemblID1) {
        $notfound{$p1}++;
      }
      if (!$EnsemblID2) {
        $notfound{$p2}++;
      }
    }
  }
  close FH;

ADD COMMENT • link 9.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for your answer :)

I don't have any experience in Perl. So I cannot understand much what you did there. Can you elaborate further ? :/

I have downloaded BioGRID data in the mitab format. How do I put all this data in a table of MySQL database ?

ADD REPLY • link 9.2 years ago by stackf01 ▴ 20

0

Entering edit mode

The script processes the MITAB file line by line, splitting each line on tab characters into an array of cells. The official gene symbol of the interacting proteins are in cells with indices 2 and 3. These cells can potentially contain multiple alternate entries separated by '|' so their content is split then the gene symbol is extracted from the first entry with regular expression simply as all characters after ':'. Then I convert the gene symbols to Ensembl gene IDs (for a specific version of Ensembl) using a custom get_EnsemblID function (which internally uses the Ensembl perl API) and put the results as nodes of an undirected graph data structure. You can accomplish the same thing with your favorite programming language.
As for putting this into a MySQL database, it depends on the structure of the database. The basic SQL query to use is INSERT INTO ppi_table (ID1,ID2) VALUES($p1,$p2) which you would use with the perl DBI and DBD::mysql modules.

ADD REPLY • link 9.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Do you have a similar script in php or java please ?

ADD REPLY • link 9.2 years ago by stackf01 ▴ 20

0

Entering edit mode

No. Sorry. But php's syntax is very similar to perl's (it's derived from perl) so you should be able to easily adapt perl code to php.

ADD REPLY • link 9.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Another enlightenment !

The file that I have downloaded, is like a table structure? with rows and columns?

ADD REPLY • link 9.2 years ago by stackf01 ▴ 20

0

Entering edit mode

Yes. If you need all the data in it you could import it directly into MySQL. First create a table with as many columns as there are in the file (name them accordingly), then you can load the data from the mysql client with LOAD DATA INFILE 'PPI.mitab' INTO TABLE ppi. Check the MySQL manual for this. Also for fast retrieval, you need to enable indexing on the columns you want to query. However, the problem with this approach is that cells of the table will have multiple entries (separated by '|') which is not good database design and may interfere with efficient use of the index.

ADD REPLY • link 9.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

So basically you get the idea what I intend to make. My webpage will fetch this data from MySQL and pass it to JSON to be read in Cytoscape and display me the protein interaction network.

ADD REPLY • link 9.2 years ago by stackf01 ▴ 20