This question follows from another question - How To Retrive A Batch Of Transmembrane Domains From Uniprot? - that asks about how to retrieve transmembrane (TM) domains from uniprot.
The top answer in that question mentions that the UniProt XML format can be used to retrieve a fasta sequence of each TM region.
filenames = ["O43561.xml", "P08195.xml", "Q58CT8.xml"]
input_format = "uniprot-xml"
feature_type = "transmembrane region"
output_filename = "uniprot_tm.fasta"
How can this XML snippet be used as part of a python script (without using biopython modules - How To Retrive A Batch Of Transmembrane Domains From Uniprot? - or Java modules - How To Retrieve Human Proteins Sequence Containing A Given Domain - as they have already been solved) to download fasta formatted files from uniprot?
The problem still remains how to only get the TM domain. This method does indeed fetch the fasta sequences, but of the entire protein.
Beautifulsoup library in Python can parse HTML page. You can get coordinates of TM domain. Get coordinates, and parse it from your protein sequence?
I have a perl script that can parse the TM domain out. But for technical reasons I want to be able to get the TM domains directly from uniprot. The answers mentioned in the question show this is possible in both biopython and java, I am looking for a way to do this in python alone.
I tried to find a way to DOWNLOAD only tm region but I couldnt. In that case I guess a good way has already been suggested by Bioinformatics_NewComer., You can try parsing the html page and get coordinates of transmembrane region. Then cut those regions from the full sequences. That is all I can think of. Do post here if you find a way to do exactly what you want.
Will do, thanks for the help!