Question

Deleted:API Download of UniProt Organism ID and Taxonomic lineage (ALL) Contains Unprintable Characters

0

Entering edit mode

3.3 years ago

katieostrouchov ▴ 30

What format is the output for the Organism ID and Taxonomic lineage when API downloading from the command line with the UniProt Perl script? I am having extreme difficulty parsing this information.

I must be missing something in my perl script because the text files I obtain with this information contain non-printable ASCII and non-ASCII text. I can view each text file in the output directory with cat, but any attempt to concatenate them with cat or cat -v returns about only 60% of the Organism ID and Taxonomic lineage information. Additionally, when I attempt to copy the files to a new directory, those files are empty with no text. I have thousands, and I cannot parse this information individually.

I am wanting to cluster proteomes by family, genus, and species to perform cd-hit-2d and remove protein redundancy for a large database. When I view the concatenated lineage text files made by "cat ./UP.txt > merged.txt" or "cat -v ./UP.txt > printable.txt", it's missing about 40% of the Organism IDs and Taxonomic lineages.

If I am not using the right approach here, how might I obtain these lineages in a readable format?

Below is my perl script:

use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Date;

# Taxonomy identifier of top node for query, e.g. 2 for Bacteria, 2157 for Archea, etc.
# (see https://www.uniprot.org/taxonomy)
my $top_node = $ARGV[0];

my $agent = LWP::UserAgent->new;

# Get a list of all reference proteomes of organisms below the given taxonomy node.
my $query_list = "https://www.uniprot.org/uniprot/?query=proteome:$top_node&columns=proteome,organism-id,lineage(ALL)&limit=1&format=tab";
my $response_list = $agent->get($query_list);
die 'Failed, got ' . $response_list->status_line .
  ' for ' . $response_list->request->uri . "\n"
  unless $response_list->is_success;

  # For each proteome, mirror its set of UniProt entries in compressed FASTA format.
for my $proteome (split(/\n/, $response_list->content)) {
  my $file = $top_node . '.txt';
  my $query_proteome = "https://www.uniprot.org/uniprot/?query=proteome:$top_node&columns=proteome,organism-id,lineage(ALL)&limit=1&format=tab";
  my $response_proteome = $agent->mirror($query_proteome, $file);

  if ($response_proteome->is_success) {
    my $results = $response_proteome->header('X-Total-Results');
    my $release = $response_proteome->header('X-UniProt-Release');
    my $date = sprintf("%4d-%02d-%02d", HTTP::Date::parse_date($response_proteome->header('Last-Modified')));
    print "File $file: downloaded $results entries of UniProt release $release ($date)\n";
  }
  elsif ($response_proteome->code == HTTP::Status::RC_NOT_MODIFIED) {
    print "File $file: up-to-date\n";
  }
  else {
    die 'Failed, got ' . $response_proteome->status_line .
      ' for ' . $response_proteome->request->uri . "\n";
  }
}

CentOS 7 script:

cat > prot.txt
UP000000252    
UP000000265
UP000000368

FILE=prot.txt
while read line; do
perl lineage.pl $line
done <$FILE

Any help would be greatly appreciated.

parse unix lineage centos7 uniprot • 322 views

ADD COMMENT • link 3.3 years ago by katieostrouchov ▴ 30