Question

How to download UniProt files using Python and XML

0

Entering edit mode

10.7 years ago

Good Gravy ▴ 20

This question follows from another question - How To Retrive A Batch Of Transmembrane Domains From Uniprot? - that asks about how to retrieve transmembrane (TM) domains from uniprot.

The top answer in that question mentions that the UniProt XML format can be used to retrieve a fasta sequence of each TM region.

filenames = ["O43561.xml", "P08195.xml", "Q58CT8.xml"]
input_format = "uniprot-xml"
feature_type = "transmembrane region"
output_filename = "uniprot_tm.fasta"

How can this XML snippet be used as part of a python script (without using biopython modules - How To Retrive A Batch Of Transmembrane Domains From Uniprot? - or Java modules - How To Retrieve Human Proteins Sequence Containing A Given Domain - as they have already been solved) to download fasta formatted files from uniprot?

python uniprot • 7.5k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by Good Gravy ▴ 20

Ram · Answer 1 · 2014-12-02

1

Entering edit mode

10.7 years ago

Nikhil Chaudhary ▴ 60

Assuming you know a decent bit of python (I dont!), you can read the filenames line and split it by the quotes (") or comma (,) and get the uniprot IDs (O43561, P08195 and so on ... ). The URL for each UID fasta file is of the form "http://www.uniprot.org/uniprot/P08195.fasta?include=yes" where you can change ur UID. Search google for a simple python script to download files by url in python using urlib. Now put you uniprot IDs one by one into the downloader script and save the fasta files as you wish.

Hope that answers your question. This method might be slow and non-standard but It is just what I would have used.

ADD COMMENT • link updated 10.7 years ago by Ram 45k • written 10.7 years ago by Nikhil Chaudhary ▴ 60

0

Entering edit mode

The problem still remains how to only get the TM domain. This method does indeed fetch the fasta sequences, but of the entire protein.

ADD REPLY • link 10.7 years ago by Good Gravy ▴ 20

0

Entering edit mode

Beautifulsoup library in Python can parse HTML page. You can get coordinates of TM domain. Get coordinates, and parse it from your protein sequence?

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by Bioinformatics_NewComer ▴ 330

0

Entering edit mode

I have a perl script that can parse the TM domain out. But for technical reasons I want to be able to get the TM domains directly from uniprot. The answers mentioned in the question show this is possible in both biopython and java, I am looking for a way to do this in python alone.

ADD REPLY • link 10.7 years ago by Good Gravy ▴ 20

1

Entering edit mode

I tried to find a way to DOWNLOAD only tm region but I couldnt. In that case I guess a good way has already been suggested by Bioinformatics_NewComer., You can try parsing the html page and get coordinates of transmembrane region. Then cut those regions from the full sequences. That is all I can think of. Do post here if you find a way to do exactly what you want.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by Nikhil Chaudhary ▴ 60

0

Entering edit mode

Will do, thanks for the help!

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by Good Gravy ▴ 20

Ram · Answer 2 · 2014-12-02

0

Entering edit mode

10.7 years ago

Bioinformatics_NewComer ▴ 330

I shall try to help you with you. In python, urllib.urlretrieve works like wget. So if you have pdb ids, you can use this. Sorry, cannot get rid of colors. :-(

You can custom your URL with PDB ids for URL http://www.uniprot.org/uniprot/P02185.fasta

This will download fasta files.

For parsing XML, python has libraries dedicated to it.

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by Bioinformatics_NewComer ▴ 330