Question

Accessing Uniprot Info Via Python

2

Entering edit mode

10.7 years ago

sonnett.matt ▴ 20

Does anyone here regularly access uniprot info using python? If so how?

I tried downloading

https://github.com/boscoh/uniprot through github but was unable to figure out the installation. What does everyone here use?

python uniprot • 15k views

ADD COMMENT • link updated 2.1 years ago by Nick ▴ 40 • written 10.7 years ago by sonnett.matt ▴ 20

1

Entering edit mode

which operating system do you have? On Ubuntu, try sudo apt-get install python-setuptools , and then sudo easy_install uniprot

ADD REPLY • link 10.7 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Yes, the instructions assume that you have the pip package installer.

ADD REPLY • link 10.7 years ago by Neilfws 49k

0

Entering edit mode

I have windows 7 with pip installed. When trying to install I typed 'pip install uniprot'

ADD REPLY • link 10.7 years ago by sonnett.matt ▴ 20

0

Entering edit mode

The installation through pip or easy_install works fine for me, in Ubuntu. In any case you can probably simply download the uniprot.py script from there, and import it from the same folder.

ADD REPLY • link 10.7 years ago by Giovanni M Dall'Olio 28k

score 4 · Answer 1 · 2021-08-18

It amazes me that this simple thing is not answered correctly on any of the biostars questions related to this issue ("how can I get uniprot sequences via python"). It took me some time to find out how to add multiple ids in the query... hopefully this is useful for future visitors ...

def get_uniprot_sequences(uniprot_ids: List) -> pd.DataFrame:
        """
        Retrieve uniprot sequences based on a list of uniprot sequence identifier.

        For large lists it is recommended to perform batch retrieval.

        documentation which columns are available:
        https://www.uniprot.org/help/uniprotkb%5Fcolumn%5Fnames

        this python script is based on
        https://www.biostars.org/p/67822/

        Parameters:
            uniprot_ids: List, list of uniprot identifier

        Returns:
            pd.DataFrame, pandas dataframe with uniprot id column and sequence
        """
        import urllib
        url = 'https://www.uniprot.org/uploadlists/'  # This is the webserver to retrieve the Uniprot data
        params = {
            'from': "ACC",
            'to': 'ACC',
            'format': 'tab',
            'query': " ".join(uniprot_ids),
            'columns': 'id,sequence'}

        data = urllib.parse.urlencode(params)
        data = data.encode('ascii')
        request = urllib.request.Request(url, data)
        with urllib.request.urlopen(request) as response:
            res = response.read()
        df_fasta = pd.read_csv(StringIO(res.decode("utf-8")), sep="\t")
        df_fasta.columns = ["Entry", "Sequence", "Query"]
        # it might happen that 2 different ids for a single query id are returned, split these rows
        return df_fasta.assign(Query=df_fasta['Query'].str.split(',')).explode('Query')

score 3 · Answer 2 · 2014-03-05

3

Entering edit mode

10.7 years ago

Egon Willighagen 5.4k

I do not regularly access Uniprot from Python, but just today solved a matching Rosalind task. My solution uses the urllib library to download the data:

import urllib
code = "Q7Z7W5"
data = urllib.urlopen("http://www.uniprot.org/uniprot/" + code + ".txt").read()

And then uses split() to process the file line by line. Each line has some structure and starts with a two character code, like "DR". The content of the lines is reasonably well structured, and, as the Rosalind task requires, allows you to extract GO ontology term annotation.

ADD COMMENT • link 10.7 years ago by Egon Willighagen 5.4k

1

Entering edit mode

This script will work correctly for an exercise, but if you want to download many entries you should consider adding a time.sleep() condition after the call, in order to not overload the Uniprot server. You should also consider adding your email address to the User-Agent header, as requested in the Uniprot's FAQs (http://www.uniprot.org/faq/28)

ADD REPLY • link 10.7 years ago by Giovanni M Dall'Olio 28k