Deleted:API Download of UniProt Organism ID and Taxonomic lineage (ALL) Contains Unprintable Characters
0
0
Entering edit mode
3.3 years ago

What format is the output for the Organism ID and Taxonomic lineage when API downloading from the command line with the UniProt Perl script? I am having extreme difficulty parsing this information.

I must be missing something in my perl script because the text files I obtain with this information contain non-printable ASCII and non-ASCII text. I can view each text file in the output directory with cat, but any attempt to concatenate them with cat or cat -v returns about only 60% of the Organism ID and Taxonomic lineage information. Additionally, when I attempt to copy the files to a new directory, those files are empty with no text. I have thousands, and I cannot parse this information individually.

I am wanting to cluster proteomes by family, genus, and species to perform cd-hit-2d and remove protein redundancy for a large database. When I view the concatenated lineage text files made by "cat ./UP.txt > merged.txt" or "cat -v ./UP.txt > printable.txt", it's missing about 40% of the Organism IDs and Taxonomic lineages.

If I am not using the right approach here, how might I obtain these lineages in a readable format?

Below is my perl script:

use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Date;

# Taxonomy identifier of top node for query, e.g. 2 for Bacteria, 2157 for Archea, etc.
# (see https://www.uniprot.org/taxonomy)
my $top_node = $ARGV[0];

my $agent = LWP::UserAgent->new;

# Get a list of all reference proteomes of organisms below the given taxonomy node.
my $query_list = "https://www.uniprot.org/uniprot/?query=proteome:$top_node&columns=proteome,organism-id,lineage(ALL)&limit=1&format=tab";
my $response_list = $agent->get($query_list);
die 'Failed, got ' . $response_list->status_line .
  ' for ' . $response_list->request->uri . "\n"
  unless $response_list->is_success;

  # For each proteome, mirror its set of UniProt entries in compressed FASTA format.
for my $proteome (split(/\n/, $response_list->content)) {
  my $file = $top_node . '.txt';
  my $query_proteome = "https://www.uniprot.org/uniprot/?query=proteome:$top_node&columns=proteome,organism-id,lineage(ALL)&limit=1&format=tab";
  my $response_proteome = $agent->mirror($query_proteome, $file);

  if ($response_proteome->is_success) {
    my $results = $response_proteome->header('X-Total-Results');
    my $release = $response_proteome->header('X-UniProt-Release');
    my $date = sprintf("%4d-%02d-%02d", HTTP::Date::parse_date($response_proteome->header('Last-Modified')));
    print "File $file: downloaded $results entries of UniProt release $release ($date)\n";
  }
  elsif ($response_proteome->code == HTTP::Status::RC_NOT_MODIFIED) {
    print "File $file: up-to-date\n";
  }
  else {
    die 'Failed, got ' . $response_proteome->status_line .
      ' for ' . $response_proteome->request->uri . "\n";
  }
}

CentOS 7 script:

cat > prot.txt
UP000000252    
UP000000265
UP000000368

FILE=prot.txt
while read line; do
perl lineage.pl $line
done <$FILE

Any help would be greatly appreciated.

parse unix lineage centos7 uniprot • 322 views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 1883 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6