I am doing a Protein protein interaction network clustering project. For testing the result I want to download protein protein interaction data set from MIPS(Munich Information Center for Protein Sequences). I've downloaded data from http://mips.helmholtz-muenchen.de/proj/ppi/.
But it is in PSI-MI format. Can anyone help me in extracting data set from this format.Also how to get benchmark set of MIPS. Old links are showing as not found. Please help me.
The format is XML so you could parse it with whatever tool you normally use for parsing XML. Here is some perl code I've been using:
use strict;
use warnings;
use XML::Simple;
my $MIPS_file=$ARGV[0];
my $xml= XML::Simple->new();
my $data=$xml->XMLin("$MIPS_file");
my $intList=$data->{'entry'}->{'interactionList'}->{'interaction'};
foreach my $int(@{$intList}){
my $experiment_type=$int->{'experimentList'}->{'experimentDescription'}->{'interactionDetection'}->{'names'}->{'shortLabel'};
my $partList=$int->{'participantList'}->{'proteinParticipant'};
my ($p1,$p2);
foreach my $protPart(@{$partList}){if($protPart->{'proteinInteractor'}->{'organism'}->{'ncbiTaxId'} eq "9606"){# select human proteinsif(!$p1){$p1=$protPart->{'proteinInteractor'}->{'xref'}->{'primaryRef'}->{'id'};}else{$p2=$protPart->{'proteinInteractor'}->{'xref'}->{'primaryRef'}->{'id'};}}}
print "$p1\$p2\n";}
I am not sure what benchmark you're referring to. There used to be a file with protein complexes that many people used as reference but I think this isn't a good dataset for benchmarking protein interaction clustering algorithms because many large housekeeping complexes are overrepresented (e.g. ribosome, polymerases). As an alternative, you could use Reactome which also has annotated protein complexes. Keep in mind though that what some biologists will view as one complex will be seen as two complexes by others so you should choose a reference dataset that matches the level of granularity you want to achieve with your clustering.
In many papers which I'd referred for clustering uses MIPS data set. But now the site they referenced for mips data is not getting. So I downloaded data from http://mips.helmholtz-muenchen.de/proj/ppi/. Actually I don't know if it is the correct data or not. Also I need specifically the ppi data set of Saccharomyces cerevisiae. Can you please help me on that.
The link you got points to mammalian protein data. You won't find yeast proteins in there. I think the MIPS data are not maintained anymore. I suggest you try an up-to-date well-maintained database like IntAct. You can download the S. cerevisiae interactions from their ftp site.
Thank you very much for that information. Since I've to compare my result with some existing papers and most of them are using MIPS, DIP or Biogrid data I went for that. I've downloaded data set of yeast from DIP and the problem is that DIP id is there, but not the common name or ORF name. I'm planning to compare result with CYC2008 benchmark, which has common name/ORF name for proteins in complexes. Could you please help me in that(converting DIP id to Common name)? Thank you once again for your help
The files available in the download section of DIP should already have gene symbols and gene names. They also contain RefSeq and UniProt IDs so you could collect these and use them as input to Ensembl's Biomart to get other names or identifiers not present in the files.
If downloading data dated recently , it doesn't have gene symbols. It has DIP-id and Reseq/Uniprot IDs(not for every interaction). Every interaction has only DIP-ID.
Are you sure you have the right file ? Here is how the first S. cerevisiae interactions in the Mi-tab file look like (first two columns only):
ID interactor A ID interactor B
DIP-328N DIP-232N|uniprotkb:Q07812
DIP-1048N|refseq:NP_002871|uniprotkb:P04049 DIP-1043N|refseq:NP_000624|uniprotkb:P10415
I tried this code. But not getting any output.The argument is the dip file.mif25 type.
use strict;
use warnings;
use XML::Simple;
my $DIP_file=$ARGV[0];
my $xml= XML::Simple->new();
my $data=$xml->XMLin("$DIP_file");
my $intList=$data->{'entrySet'}->{'entry'}->{'interactorList'}->{'interactor'};
print $intList;
foreach my $int(@{$intList}){
print $int->{'names'}->{'shortLabel'}->text;}
In many papers which I'd referred for clustering uses MIPS data set. But now the site they referenced for mips data is not getting. So I downloaded data from http://mips.helmholtz-muenchen.de/proj/ppi/. Actually I don't know if it is the correct data or not. Also I need specifically the ppi data set of Saccharomyces cerevisiae. Can you please help me on that.
The link you got points to mammalian protein data. You won't find yeast proteins in there. I think the MIPS data are not maintained anymore. I suggest you try an up-to-date well-maintained database like IntAct. You can download the S. cerevisiae interactions from their ftp site.
Thank you very much for that information. Since I've to compare my result with some existing papers and most of them are using MIPS, DIP or Biogrid data I went for that. I've downloaded data set of yeast from DIP and the problem is that DIP id is there, but not the common name or ORF name. I'm planning to compare result with CYC2008 benchmark, which has common name/ORF name for proteins in complexes. Could you please help me in that(converting DIP id to Common name)? Thank you once again for your help
The files available in the download section of DIP should already have gene symbols and gene names. They also contain RefSeq and UniProt IDs so you could collect these and use them as input to Ensembl's Biomart to get other names or identifiers not present in the files.
If downloading data dated recently , it doesn't have gene symbols. It has DIP-id and Reseq/Uniprot IDs(not for every interaction). Every interaction has only DIP-ID.
Are you sure you have the right file ? Here is how the first S. cerevisiae interactions in the Mi-tab file look like (first two columns only):
The PSI-MI (XML) file has gene names e.g.:
I am using tab files, that is not having gene names. I will now try the PSI-Mi file. Thank you very much
I tried PSIMI file with a perl script , but not working. Could you please give me script for parsing DIP ppi interactions only. Thank you
I tried this code. But not getting any output.The argument is the dip file.mif25 type.
Could you please help me
Interactions are in the
<interactionList>
section. Try something like