I recently had need to automate a biomart query and found that the perl API was most convenient for this purpose. Eventually I believe that BioMart will be moving away from (or refactoring) the perl API. But, until then, it seems the most convenient way to access biomart programmatically.
This sample script queries the InterPro biomart website for details corresponding to an InterPro accession. A sample perl snippet was obtained from the Biomart website and used as a starting point. The result will be a list of UniProtKB protein accessions and other details for the provided InterPro accession, after several filters. Almost any query you construct at the BioMart web api could be run in this manner. Simply click on the 'Perl' button to see how query lines would need to be changed. The script below should help you with some issues which are not explained in the provided code snippets and (non-existent) documentation for the Perl API. This includes: how to handle timeout errors, how to turn result counting on and off, and how to redirect output from STDOUT to a file.
You must have biomart-perl installed for this script to work. This can be downloaded from: http://www.biomart.org/other/install-overview.html. See the section title "1.2 Downloading biomart-perl" for CVS commands to run and "1.4 Installing biomart-perl" for instructions on how to install. There were a number of dependencies missing during my installation, but the following code worked without resolving them. Results may vary - ideally you will want root access or have your system admin install any missing dependencies.
A registry file must also be provided. This can be obtained from: http://www.biomart.org/biomart/martservice?type=registry. Copy this into a file and then delete all entries except those corresponding to INTERPRO and UNIPROT (or whichever database(s) you intend to query). This last step reduces the amount of time required to load all registries.
Note regarding timeout errors: If queries are taking too long to complete and you receive time out errors. Find the following line: $ua->timeout(20);
in ~/biomart-perl/lib/BioMart/Configuration/URLLocation.pm
(wherever you installed biomart-perl) and increase value to 180 (i.e., $ua->timeout(180);
). I have also written the script below so that it will automatically retry queries until they succeed.
#!/usr/bin/perl
use strict;
use warnings;
use lib '~/biomart-perl/lib'; #Set this to path where you installed biomart-perl
use BioMart::Initializer;
use BioMart::Query;
use BioMart::QueryRunner;
my $confFile = "~/biomart-perl/conf/biomart_Interpro_registry.xml"; #Set this to path where you downloaded registry file
my $tempfile = "biomart_query_temp.txt";
#Note: change action to 'clean' if you wish to start a fresh configuration
#Set to 'cached' if you want to skip configuration step on subsequent runs from the same registry
my $action='cached';
my $initializer = BioMart::Initializer->new('registryFile'=>$confFile, 'action'=>$action);
my $registry = $initializer->getRegistry;
For this example we will query Uniprot Biomart with a single InterPro query term and filter down to only proteins: (1) In "The complete human proteome", see: http://www.uniprot.org/faq/48. (2) With Swiss-prot (Reviewed) status, see http://www.uniprot.org/faq/7. (3) With evidence at protein level, see http://www.uniprot.org/docs/pe_criteria. For output, we will retrieve: Uniprot Accession, Uniprot Id, Uniprot Protein Name, Uniprot Gene Name
my $queryterm="IPR000022";
print "\nAttempting UniProt list query for $queryterm\n";
my $query = BioMart::Query->new('registry'=>$registry,'virtualSchemaName'=>'default');
$query->setDataset("uniprot");
$query->addFilter("interpro_id", [$queryterm]);
$query->addFilter("proteome_name", ["Homo sapiens"]);
$query->addFilter("entry_type", ["Swiss-Prot"]);
$query->addFilter("protein_evidence", ["1: Evidence at protein level"]);
$query->addAttribute("accession");
$query->addAttribute("name");
$query->addAttribute("protein_name");
$query->addAttribute("gene_name");
$query->addAttribute("protein_evidence");
$query->addAttribute("entry_type");
my $query_runner = BioMart::QueryRunner->new();
$query_runner->uniqueRowsOnly(1); #to obtain unique rows only
Get count of expected results - use to make sure results are complete
my $count_query_attempt=1;
#Turn on counting
$query->count(1);
my $query_count;
do {
print "Attempting query count, attempt $count_query_attempt\n";
$query_runner->execute($query);
$query_count=$query_runner->getCount();
sleep(1);
$count_query_attempt++;
} until ($query_count);
print "$query_count results expected for query\n";
#turn off counting so that full results can be obtained below
$query->count(0);
Perform main query of interest. Note that results are directed to STDOUT by default. Therefore we will redirect and store output in a temporary file.
my $query_attempt=1;
my $result_count;
my @results;
do {
print "Attempting query, attempt $query_attempt\n";
open (BIOMART_OUT, ">$tempfile") or die "Can't open $tempfile file for write\n";
$query_runner->execute($query);
#$query_runner->printHeader(\*BIOMART_OUT);
$query_runner->printResults(\*BIOMART_OUT);
#$query_runner->printFooter(\*BIOMART_OUT);
close BIOMART_OUT;
#Read in results and check expected results against count above
open (BIOMART_IN, "$tempfile") or die "Can't open $tempfile\n";
@results=<BIOMART_IN>;
close BIOMART_IN;
$result_count=@results;
print "$result_count results returned for query\n\n";
sleep(1);
$query_attempt++;
} until ($result_count==$query_count);
Finally, parse the results and print out in a tab-delimited format
chomp (@results);
my %UniProtDetails;
foreach my $result (@results){
my @data=split("\t", $result);
my $Uniprot_acc=$data[0];
my $Uniprot_id=$data[1];
my $Uniprot_protein_name=$data[2]; unless($Uniprot_protein_name){$Uniprot_protein_name="NA";}
my $Uniprot_gene_name=$data[3]; unless($Uniprot_gene_name){$Uniprot_gene_name="NA";}
my $Uniprot_evidence=$data[4]; unless($Uniprot_evidence){$Uniprot_evidence="NA";}
my $Uniprot_status=$data[5]; unless($Uniprot_status){$Uniprot_status="NA";}
$UniProtDetails{$Uniprot_acc}{Uniprot_id}=$Uniprot_id;
$UniProtDetails{$Uniprot_acc}{Uniprot_protein_name}=$Uniprot_protein_name;
$UniProtDetails{$Uniprot_acc}{Uniprot_gene_name}=$Uniprot_gene_name;
$UniProtDetails{$Uniprot_acc}{Uniprot_evidence}=$Uniprot_evidence;
$UniProtDetails{$Uniprot_acc}{Uniprot_status}=$Uniprot_status;
}
print "Uniprot_acc\tUniprot_id\tUniprot_protein_name\tUniprot_gene_name\n";
foreach my $uniprot_acc (sort keys %UniProtDetails){
print "$uniprot_acc\t$UniProtDetails{$uniprot_acc}{'Uniprot_id'}\t$UniProtDetails{$uniprot_acc}{'Uniprot_protein_name'}\t$UniProtDetails{$uniprot_acc}{'Uniprot_gene_name'}\n";
}
The fact that you have to login to CVS to access the biomart-perl is painful! Why not just put it on CPAN!
I get error
Unknown host cvs.sanger.ac.uk.
when try to install from CVS.If someone is still using this, BioMart perl is on github: https://github.com/biomart/biomart-perl It is still working. In the example code above, somehow
>
got replaced by>
.Suppose one lacks the permissions to alter the source directly to increase the timeout value? Is there a way catch the timeout error or to increase the timeout value in some other way?