How To Find Sequences For Protein Names (A Challenge)
8
5
Entering edit mode
14.1 years ago

I came across an interesting problem yesterday, and, while I solved it in my own way, I thought it would make an interesting challenge for you good folks.

In a situation where you have only the name for a protein in UniProt (rather than the accession), it is perfectly possible to visit a URL such as: http://www.uniprot.org/uniprot/OPSD_HUMAN and the UniProt website will redirect you to http://www.uniprot.org/uniprot/P08100, which is, obviously, the UniProt entry for the protein concerned.

If you have a list of such names, and you want to get FASTA sequences for them, to me the obvious thing to do would be to go to http://www.uniprot.org/uniprot/OPSD_HUMAN.fasta, which in light of the previous behaviour, I would expect to redirect to http://www.uniprot.org/uniprot/P08100.fasta. Unfortunately this is not the case, the .fasta URL also redirects to the UniProt entry for a protein, not to the sequence.

I would imagine it is a one-line fix to "correct" this behaviour, and I was wondering if the Biostar-ers could come up with a solution of a similar length (rather than my crufty Python solution).

So, to clarify - a way to retrieve FASTA sequences for a list of UniProt protein names. An example list can be downloaded from here.

Some rules - there are no rules - any language, any library, any tool. I'm very interested in how broad a range of solutions we can produce. Obviously there is no correct answer, so the highest scoring response by the end of the week (i.e. 5pm Friday UK time, or as close to) gets the accepted answer bonus.

sequence uniprot conversion • 9.9k views
ADD COMMENT
8
Entering edit mode
14.1 years ago
Mary 11k

Ok, I'll do mine in English:

  1. go to UniProt.org.
  2. click tab "retrieve"
  3. Paste list into text box. Click Retrieve button.
  4. On results page, click FASTA download [ Download (30 KB*) | Open ] (Or you could click open just to have a look).

I'm done.

ADD COMMENT
1
Entering edit mode

Your "using an already available webform" solution of course wins against my script ;-)

ADD REPLY
0
Entering edit mode

Well, I can get mine down to 95 characters: Go to uniprot.org, click Retrieve tab. Paste list in box, click Retrieve button. Choose output. [?] And, may I add, that mine outputs as web interface, FASTA, GFF, Flat text, xml, rdf/xml, or a list of the corresponding IDs.

ADD REPLY
0
Entering edit mode

Heh. Yeah, but that wasn't ruled out! There were no rules--it said any tool! You didn't have to write the platform you are using, right? Kidding. I'm just trying to be the voice of a non-coder here. If this site is to be broadly useful to people, scripts may not always be the only answer.

ADD REPLY
0
Entering edit mode

Scripts are the only answer. But in your case, one of the uniprot devs did the writing ;-)

ADD REPLY
0
Entering edit mode

yep, it's always a code or script that does the work, whether already done in a web-based tool or newly created. But, no use in reinventing the wheel (unless it's a better wheel of course) :D.

ADD REPLY
0
Entering edit mode

ROFL. Yeah, no one would ever want to reuse anyone else's existing code...wait a minute.... (PS: I can't believe that my answer has a positive value score. I expected to get downrated for this answer!)

ADD REPLY
5
Entering edit mode
14.1 years ago

Quite short (128 characters) and simple in bash:

for N in $NAMES; do curl -D o http://www.uniprot.org/uniprot/$N && \
wget `egrep -o http[\.\:\/a-zA-Z0-9]+ o`.fasta && rm o; done

$NAMES needs to be an array of names, can be replaced with cat filename (in backticks) if names are there.

edit: the shortest way I can come up with is 109 characters:

U="www.uniprot.org/uniprot/";for N in $NAMES;do curl -D o $U$N;\
wget `egrep -o $U[0-9A-Z]+ o`.fasta;rm o;done
ADD COMMENT
4
Entering edit mode
14.1 years ago
Neilfws 49k

A couple of Bioperl and Bioruby solutions.

1. Bioperl

If you install the utility scripts, there's one named bp_fetch which you can run as:

bp_fetch net::genpept:OPSD_HUMAN
# result
>OPSD_HUMAN RecName: Full=Rhodopsin; AltName: Full=Opsin-2.
MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLY
VTVQHKKLRTPLNYILLNLAVADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLG
GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIP
EGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFTVKEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAI
YNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA

Although obviously that fetches from GenPept, rather than UniProt.

The more detailed Bioperl solution:

use Bio::DB::SwissProt;    
use Bio::SeqIO;
use strict;
my $sp = Bio::DB::SwissProt->new('-servertype'   => 'expasy',
                                 '-hostlocation' => 'australia');

my $seq = $sp->get_Seq_by_id('OPSD_HUMAN');
my $outfile = Bio::SeqIO->new(-file => ">OPSD_HUMAN.fa", -format => 'fasta');
$outfile->write_seq($seq);

2. Bioruby

Bioruby also comes with a utility called br_biofetch.rb. It runs like this:

br_biofetch.rb -s http://www.ebi.ac.uk/cgi-bin/dbfetch swissprot OPSD_HUMAN raw fasta
# result
>sp|P08100|OPSD_HUMAN Rhodopsin;
MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLY
VTVQHKKLRTPLNYILLNLAVADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLG
GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIP
EGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFTVKEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAI
YNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA

As before, you can also write scripts around the module, Bio::Fetch, as described here. Writing the loops to fetch multiple queries should be easy enough.

ADD COMMENT
3
Entering edit mode
14.1 years ago
Heikki ▴ 360

It is good to remember that UniProt maintains up-to-date mappings between not only between AC and ID but also a large number of of other databases. See the README file and rest of the contents of this directory:

ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/idmapping

These mappings are easy to tie to any script fetching the entries.

ADD COMMENT
0
Entering edit mode

This is how I did it :)

ADD REPLY
0
Entering edit mode

+1 for ID mapping

ADD REPLY
3
Entering edit mode
14.1 years ago

Using awk and mysql (uniprot db at UCSC):

curl -s "http://bioinf1.ncl.ac.uk/biostar/protein.list" |\
egrep -v '^$' |\
awk 'BEGIN { printf("select concat(\">\",P.acc,\":\",D.val,char(10),P.val) from protein as P, displayId as D where D.acc=P.acc and D.val in (\"_IGNORE_\"\n",$1);} { printf(",\"%s\"\n",$1); } END { printf(");\n");}' |\
mysql  -N -h  genome-mysql.cse.ucsc.edu -A -u genome -D uniProt -r

>A4L9L3:A4L9L3_CAVPO
MNGTEGENFYIPFSNATGVVRSPFEYPQYYLAEPWQFSILAAYMFMLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVANLFMVLGGFTTTLYTSMNGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVVFTWIMALACAAPPLVGWSRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWVPYASVAAYIFTHQGSNFGPIFMTVPAFFAKSSSIYNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASTTVSKTETSQVAPA
>P02699:OPSD_BOVIN
MNGTEGPNFYVPFSNKTGVVRSPFEAPQYYLAEPWQFSMLAAYMFLLIMLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQFRNCMVTTLCCGKNPLGDDEASTTVSKTETSQVAPA
>Q6W3E1:OPSD_CALPD
MNGTEGPNFYVPFSNKTGVVRSPFEEPQYYLAEPWQFSCLAAYMFMLIVLGFPINFLTLYVTIQHKKLRTPLNYILLNLAIADLFMVFGGFTTTLYTSLHGYFVFGPTGCDLEGFFATLGGEIALWSLVVLAIERYIVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMVVIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPILMTLPAFFAKTSAVYNPVIYIMLNKQFRTCMLTTLCCGKIPLGDDEASATASKTETSQVAPA
(...)
ADD COMMENT
3
Entering edit mode
14.1 years ago

My second solution: using TOGOWS/REST

Nucleic Acids Res. 2010 Jul 1;38 Suppl:W706-11. Epub 2010 May 14. TogoWS: integrated SOAP and REST APIs for interoperable bioinformatics Web services. Katayama T, Nakao M, Takagi T. http://www.ncbi.nlm.nih.gov/pubmed/20472643

for ACC in ` curl -s "http://bioinf1.ncl.ac.uk/biostar/protein.list" | egrep -v '^$' `
do
   curl "http://togows.dbcls.jp/entry/uniprot/${ACC}.fasta"
done

>sp|P02699|OPSD_BOVIN Rhodopsin;
MNGTEGPNFYVPFSNKTGVVRSPFEAPQYYLAEPWQFSMLAAYMFLLIMLGFPINFLTLY
VTVQHKKLRTPLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCNLEGFFATLG
GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYIP
EGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAV
YNPVIYIMMNKQFRNCMVTTLCCGKNPLGDDEASTTVSKTETSQVAPA
>sp|Q6W3E1|OPSD_CALPD Rhodopsin;
MNGTEGPNFYVPFSNKTGVVRSPFEEPQYYLAEPWQFSCLAAYMFMLIVLGFPINFLTLY
VTIQHKKLRTPLNYILLNLAIADLFMVFGGFTTTLYTSLHGYFVFGPTGCDLEGFFATLG
GEIALWSLVVLAIERYIVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYIP
EGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMVVIFFCYGQLVFTVKEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPILMTLPAFFAKTSAV
YNPVIYIMLNKQFRTCMLTTLCCGKIPLGDDEASATASKTETSQVAPA
>sp|P32308|OPSD_CANFA Rhodopsin;
MNGTEGPNFYVPFSNKTGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLY
VTVQHKKLRTPLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCNVEGFFATLG
GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIP
EGMQCSCGIDYYTLKPEINNESFVIYMFVVHFAIPMIVIFFCYGQLVFTVKEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSDFGPIFMTLPAFFAKSSSI
YNPVIYIMMNKQFRNCMITTLCCGKNPLGDDEASASASKTETSQVAPA
ADD COMMENT
2
Entering edit mode
14.1 years ago

You can get the mappings directly using the UniProt REST interface. This is a bit more general than a URL modification since you can make use of the mapping API to go from any input identifier to the UniProt reference. Here's a Clojure script that fetches matching accessions for any number of identifiers on the command line, and then writes them out as FASTA files:

(comment "
  Retrieve UniProt FASTA files for provided input identifiers using REST API.
    http://www.uniprot.org/faq/28
  Usage:
    cljr run fetch_uniprot_fasta.clj [Any number of Uniprot IDs]
")

(ns biostar.uniprot
  (:import [java.net URLEncoder])
  (:use [clojure.contrib.duck-streams :only (with-out-writer)])
  (:require [clojure.contrib.http.agent :as http]
            [clojure.contrib.str-utils2 :as str]))

(defn- url-encode [v] (URLEncoder/encode (str v) "utf-8"))

(defn id-to-acc [uid]
  "Map a named identifier to accessions using UniProt REST API."
  (let [map-url "http://www.uniprot.org/mapping/"
        params {:from "ID" :to "ACC" :format "tab" :query uid}
        param-str (str/join "&" (for [[k v] params] 
                                  (str (name k) "=" (url-encode v))))
        cur-url (format "%s?%s" map-url param-str)
        result (http/string (http/http-agent cur-url))]
    (for [line (rest (str/split-lines result))]
      (second (str/split line #"\t")))))

(defn id-to-fasta [uid]
  "Get FASTA from UniProt REST API, removing duplicate IDs with same record."
  (let [fasta-url "http://www.uniprot.org/uniprot/"
        get-fasta (fn [acc]
                    (let [acc-url (format "%s%s.fasta" fasta-url acc)]
                      (http/string (http/http-agent acc-url))))]
    (distinct (for [acc (id-to-acc uid)] (get-fasta acc)))))

(defn write-as-fasta [uid]
  "Write out FASTA records for a UniProt identifier."
  (let [out-file (format "%s.fa" uid)]
    (with-out-writer out-file
      (doseq [fa (id-to-fasta uid)]
        (println fa)))))

(when *command-line-args*
  (doseq [uid *command-line-args*]
    (write-as-fasta uid))
  (shutdown-agents))

Use the cljr Clojure REPL to run it:

cljr run fetch_uniprot_fasta.clj OPSD_HUMAN TEST_HUMAN
ADD COMMENT

Login before adding your answer.

Traffic: 1987 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6