Hello there. I have downloaded bulk of PPI data from BioGRID in the current release section. It is in PSI - MITAB format. I have downloaded this data to add in msql database. My question is how to I get this data which is in PSI -MITAB format in a proper format to integrate the data in MySQL database?
Here is the perl code I use to put PPI into an undirected graph. It is only concerned with extracting protein IDs and Pubmed refs but you can easily modify it to get the rest.
use Graph::Undirected;
my $G = Graph::Undirected->new();
my %notfound;
open (FH,"<",$BIOGRID) or die "Can't open file $BIOGRID: $!\n";
while (my $line=<FH>) {
chomp($line);
my @tmpary = split(/\t/,$line);
my @P1s = split(/\|/,$tmpary[2]);
my ($p1) = $P1s[0]=~/:(.+)/;
my @P2s = split(/\|/,$tmpary[3]);
my ($p2) = $P2s[0]=~/:(.+)/;
next if (!$p1 || !$p2 || $notfound{$p1} || $notfound{$p2});
my $EnsemblID1 = &get_EnsemblID($p1); # This is a function to convert to Ensembl IDs
my $EnsemblID2 = &get_EnsemblID($p2);
if ($EnsemblID1 && $EnsemblID2) {
$G->add_edge($EnsemblID1,$EnsemblID2);
if ($tmpary[8]) {
my ($pmid) = $tmpary[8];
$G->set_edge_attribute($EnsemblID1,$EnsemblID2,'pubmed',$pmid);
}
}
else {
if (!$EnsemblID1) {
$notfound{$p1}++;
}
if (!$EnsemblID2) {
$notfound{$p2}++;
}
}
}
close FH;
The script processes the MITAB file line by line, splitting each line on tab characters into an array of cells. The official gene symbol of the interacting proteins are in cells with indices 2 and 3. These cells can potentially contain multiple alternate entries separated by '|' so their content is split then the gene symbol is extracted from the first entry with regular expression simply as all characters after ':'. Then I convert the gene symbols to Ensembl gene IDs (for a specific version of Ensembl) using a custom get_EnsemblID function (which internally uses the Ensembl perl API) and put the results as nodes of an undirected graph data structure. You can accomplish the same thing with your favorite programming language.
As for putting this into a MySQL database, it depends on the structure of the database. The basic SQL query to use is INSERT INTO ppi_table (ID1,ID2) VALUES($p1,$p2) which you would use with the perl DBI and DBD::mysql modules.
Yes. If you need all the data in it you could import it directly into MySQL. First create a table with as many columns as there are in the file (name them accordingly), then you can load the data from the mysql client with LOAD DATA INFILE 'PPI.mitab' INTO TABLE ppi. Check the MySQL manual for this. Also for fast retrieval, you need to enable indexing on the columns you want to query. However, the problem with this approach is that cells of the table will have multiple entries (separated by '|') which is not good database design and may interfere with efficient use of the index.
So basically you get the idea what I intend to make. My webpage will fetch this data from MySQL and pass it to JSON to be read in Cytoscape and display me the protein interaction network.
Thanks for your answer :)
I don't have any experience in Perl. So I cannot understand much what you did there. Can you elaborate further ? :/
I have downloaded BioGRID data in the mitab format. How do I put all this data in a table of MySQL database ?
The script processes the MITAB file line by line, splitting each line on tab characters into an array of cells. The official gene symbol of the interacting proteins are in cells with indices 2 and 3. These cells can potentially contain multiple alternate entries separated by '|' so their content is split then the gene symbol is extracted from the first entry with regular expression simply as all characters after ':'. Then I convert the gene symbols to Ensembl gene IDs (for a specific version of Ensembl) using a custom get_EnsemblID function (which internally uses the Ensembl perl API) and put the results as nodes of an undirected graph data structure. You can accomplish the same thing with your favorite programming language.
As for putting this into a MySQL database, it depends on the structure of the database. The basic SQL query to use is INSERT INTO ppi_table (ID1,ID2) VALUES($p1,$p2) which you would use with the perl DBI and DBD::mysql modules.
Do you have a similar script in php or java please ?
No. Sorry. But php's syntax is very similar to perl's (it's derived from perl) so you should be able to easily adapt perl code to php.
Another enlightenment !
The file that I have downloaded, is like a table structure? with rows and columns?
Yes. If you need all the data in it you could import it directly into MySQL. First create a table with as many columns as there are in the file (name them accordingly), then you can load the data from the mysql client with LOAD DATA INFILE 'PPI.mitab' INTO TABLE ppi. Check the MySQL manual for this. Also for fast retrieval, you need to enable indexing on the columns you want to query. However, the problem with this approach is that cells of the table will have multiple entries (separated by '|') which is not good database design and may interfere with efficient use of the index.
So basically you get the idea what I intend to make. My webpage will fetch this data from MySQL and pass it to JSON to be read in Cytoscape and display me the protein interaction network.