What format is the output for the Organism ID and Taxonomic lineage when API downloading from the command line with the UniProt Perl script? I am having extreme difficulty parsing this information.
I must be missing something in my perl script because the text files I obtain with this information contain non-printable ASCII and non-ASCII text. I can view each text file in the output directory with cat, but any attempt to concatenate them with cat or cat -v returns about only 60% of the Organism ID and Taxonomic lineage information. Additionally, when I attempt to copy the files to a new directory, those files are empty with no text. I have thousands, and I cannot parse this information individually.
I am wanting to cluster proteomes by family, genus, and species to perform cd-hit-2d and remove protein redundancy for a large database. When I view the concatenated lineage text files made by "cat ./UP.txt > merged.txt" or "cat -v ./UP.txt > printable.txt", it's missing about 40% of the Organism IDs and Taxonomic lineages.
If I am not using the right approach here, how might I obtain these lineages in a readable format?
Below is my perl script:
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Date;
# Taxonomy identifier of top node for query, e.g. 2 for Bacteria, 2157 for Archea, etc.
# (see https://www.uniprot.org/taxonomy)
my $top_node = $ARGV[0];
my $agent = LWP::UserAgent->new;
# Get a list of all reference proteomes of organisms below the given taxonomy node.
my $query_list = "https://www.uniprot.org/uniprot/?query=proteome:$top_node&columns=proteome,organism-id,lineage(ALL)&limit=1&format=tab";
my $response_list = $agent->get($query_list);
die 'Failed, got ' . $response_list->status_line .
' for ' . $response_list->request->uri . "\n"
unless $response_list->is_success;
# For each proteome, mirror its set of UniProt entries in compressed FASTA format.
for my $proteome (split(/\n/, $response_list->content)) {
my $file = $top_node . '.txt';
my $query_proteome = "https://www.uniprot.org/uniprot/?query=proteome:$top_node&columns=proteome,organism-id,lineage(ALL)&limit=1&format=tab";
my $response_proteome = $agent->mirror($query_proteome, $file);
if ($response_proteome->is_success) {
my $results = $response_proteome->header('X-Total-Results');
my $release = $response_proteome->header('X-UniProt-Release');
my $date = sprintf("%4d-%02d-%02d", HTTP::Date::parse_date($response_proteome->header('Last-Modified')));
print "File $file: downloaded $results entries of UniProt release $release ($date)\n";
}
elsif ($response_proteome->code == HTTP::Status::RC_NOT_MODIFIED) {
print "File $file: up-to-date\n";
}
else {
die 'Failed, got ' . $response_proteome->status_line .
' for ' . $response_proteome->request->uri . "\n";
}
}
CentOS 7 script:
cat > prot.txt
UP000000252
UP000000265
UP000000368
FILE=prot.txt
while read line; do
perl lineage.pl $line
done <$FILE
Any help would be greatly appreciated.