Question

How To Retrive A Batch Of Transmembrane Domains From Uniprot?

0

Entering edit mode

11.8 years ago

kevinjspring ▴ 20

I want to download a batch of sequence data from UniProt but I only want transmembrane annotated regions. On the UniProt website I am able to go into each individual entry and under 'Sequence Annotation (Features)' I can retrieve only the specific area of the sequence I want. This is helpful but I need to do this to many entries so I was looking to see if there is a batch option to retrieve. Any tips on how to download a batch of protein sequences that only contain a specific annotated region?

Example:

LAT, 4F2hc, and LAX are all integral, single-pass transmembrane proteins. The full FASTA sequence for these three proteins are:

>sp|O43561|LAT_HUMAN Linker for activation of T-cells family member 1 OS=Homo sapiens GN=LAT PE=1 SV=1
MEEAILVPCVLGLLLLPILAMLMALCVHCHRLPGSYDSTSSDSLYPRGIQFKRPHTVAPW
PPAYPPVTSYPPLSQPDLLPIPRSPQPLGGSHRTPSSRRDSDGANSVASYENEGASGIRG
AQAGWGVWGPSWTRLTPVSLPPEPACEDADEDEDDYHNPGYLVVLPDSTPATSTAAPSAP
ALSTPGIRDSAFSMESIDDYVNVPESGESAEASLDGSREYVNVSQELHPGAAKTEPAALS
SQEAEEVEEEGAPDYENLQELN
>sp|P08195|4F2_HUMAN 4F2 cell-surface antigen heavy chain OS=Homo sapiens GN=SLC3A2 PE=1 SV=3
MELQPPEASIAVVSIPRQLPGSHSEAGVQGLSAGDDSELGSHCVAQTGLELLASGDPLPS
ASQNAEMIETGSDCVTQAGLQLLASSDPPALASKNAEVTGTMSQDTEVDMKEVELNELEP
EKQPMNAASGAAMSLAGAEKNGLVKIKVAEDEAEAAAAAKFTGLSKEELLKVAGSPGWVR
TRWALLLLFWLGWLGMLAGAVVIIVRAPRCRELPAQKWWHTGALYRIGDLQAFQGHGAGN
LAGLKGRLDYLSSLKVKGLVLGPIHKNQKDDVAQTDLLQIDPNFGSKEDFDSLLQSAKKK
SIRVILDLTPNYRGENSWFSTQVDTVATKVKDALEFWLQAGVDGFQVRDIENLKDASSFL
AEWQNITKGFSEDRLLIAGTNSSDLQQILSLLESNKDLLLTSSYLSDSGSTGEHTKSLVT
QYLNATGNRWCSWSLSQARLLTSFLPAQLLRLYQLMLFTLPGTPVFSYGDEIGLDAAALP
GQPMEAPVMLWDESSFPDIPGAVSANMTVKGQSEDPGSLLSLFRRLSDQRSKERSLLHGD
FHAFSAGPGLFSYIRHWDQNERFLVVLNFGDVGLSAGLQASDLPASASLPAKADLLLSTQ
PGREEGSPLELERLKLEPHEGLLLRFPYAA
>sp|Q58CT8|LAX1_BOVIN Lymphocyte transmembrane adapter 1 OS=Bos taurus GN=LAX1 PE=2 SV=1
MDVTTSAWSETTRRISEPSTLQGTLGSLDKAEDHSSSIFSGFAALLAILLVVAVICVLWC
CGKRKKRQVPYLRVTIMPLLTLPRPRQRAKNIYDLLPRRQEELGRHPSRSIRIVSTESLL
SRNSDSPSSEHVPSRAGDALHMHRAHTHAMGYAVGIYDNAMRPQMCGNLAPSPHYVNVRA
SRGSPSTSSEDSRDYVNIPTAKEIAETLASASNPPRNLFILPGTKELAPSEEIDEGCGNA
SDCTSLGSPGTENSDPLSDGEGSSQTSNDYVNMAELDLGTPQGKQLQGMFQCRRDYENVP
PGPSSNKQQEEEVTSSNTDHVEGRTDGPETHTPPAVQSGSFLALKDHVACQSSAHSETGP
WEDAEETSSEDSHDYENVCAAEAGARG

The data I want is to be able to retrieve from the UniProt site is:

>sp|O43561|5-27
ILVPCVLGLLLLPILAMLMALCV
>sp|P08195|185-205
LLLLFWLGWLGMLAGAVVIIV
>sp|Q58CT8|38-58
IFSGFAALLAILLVVAVICVL

Which corresponds to the single transmembrane domain located in that protein.

The XML data that lists the TM annotation is:

<feature type="transmembrane region" description="Helical; Signal-anchor for type II membrane protein;" status="potential"><location><begin position="185"/><end position="205"/></location></feature>

I might be able to parse this and then use the position data to save only the sequence data needed. Does Biopython have this parser yet?

uniprot biopython • 7.3k views

ADD COMMENT • link updated 11.8 years ago by Peter 6.0k • written 11.8 years ago by kevinjspring ▴ 20

1

Entering edit mode

You can download the records as UniProt XML, or the old "SwissProt" plain text, and parse them locally to look for transmembrane domains & then extract the sequence for them. At least that's what I would try using Biopython.

Could you give a specific example (e.g. a UniProt protein ID where there are 3 transmembrane domains) and the desired output (e.g. a FASTA file with the region containing the three transmembrane domains only)?

ADD REPLY • link 11.8 years ago by Peter 6.0k

0

Entering edit mode

I am primarily interested in single-pass transmembrane proteins.

ADD REPLY • link 11.8 years ago by kevinjspring ▴ 20

0

Entering edit mode

I updated with some example data. Does BioPython have a parser for XML data from UniProt?

ADD REPLY • link 11.8 years ago by kevinjspring ▴ 20

1

Entering edit mode

Yes, "uniprot-xml" and "swiss" (plain text) are available in Biopython's Bio.SeqIO module, see http://biopython.org/wiki/SeqIO

ADD REPLY • link 11.8 years ago by Peter 6.0k

score 3 · Answer 1 · 2013-10-27

Using the plain text SwissProt format, something like this using Biopython?

# Hard coded list, could use os.listdir(...) or glob?                                                           
filenames = ["O43561.txt", "P08195.txt", "Q58CT8.txt"]
input_format = "swiss"
feature_type = "TRANSMEM"
output_filename = "swiss_tm.fasta"

#Real code starts here...
from Bio import SeqIO
output = open(output_filename, "w")
for filename in filenames:
    # Using SeqIO.parse will cope with multi-record files
    for record in SeqIO.parse(filename, input_format):
        for f in record.features:
            if f.type == feature_type:
                title = "sp|%s|%i-%i" % record.id, f.location.start+1, f.location.end)
                output.write(">%s\n%s\n" % (title, f.extract(record.seq)))
output.close()

Or, using the UniProt XML format, change these lines:

filenames = ["O43561.xml", "P08195.xml", "Q58CT8.xml"]
input_format = "uniprot-xml"
feature_type = "transmembrane region"
output_filename = "uniprot_tm.fasta"

Either should give this as the FASTA format output:

>sp|O43561|5-27
ILVPCVLGLLLLPILAMLMALCV
>sp|P08195|185-205
LLLLFWLGWLGMLAGAVVIIV
>sp|Q58CT8|38-58
IFSGFAALLAILLVVAVICVL

Note I have not talked about how to automatically download the SwissProt/UniProt files, which would be a separate question.

score 2 · Answer 2 · 2013-10-24

2

Entering edit mode

11.8 years ago

Pierre Lindenbaum 166k

See my answer for How to retrieve human proteins sequence containing a given domain

java Biostar5862 -t "transmembrane region"
>[11011_ASFP4]|26-46
PFGCNMKGLGVLLGLFSLILA
>[11011_ASFP4]|154-174
LTLKQYCLYFIISIAFAGCFV
>[11011_ASFP4]|183-203
LNTTIKLLTLLSILVYLAQPV
>[141R_IIV6]|49-69
YIIYAIVAAILLLLFWLLYKK
>[14KD_RHOSH]|85-102
LGGFASGALLALALAGIF
>[1A29_HUMAN]|309-332
VGIIAGLVLFGAVFAGAVVAAVRW
>[1B01_PANTR]|306-329
GIVAGLAVLVVTVAVVAVVAAVMC
>[1B54_HUMAN]|309-332
VGIVAGLAVLAVVVIGAVVATVMC
>[1C18_HUMAN]|309-333
VGIVAGLAVLVVLAVLGAVVAVVMC
>[34KD_MYCPA]|42-62
IAVVALGFAAYLLNFGPTFTI

ADD COMMENT • link 11.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I don't have any experience with Java. I will give it a try, but I was hoping there was something I could use with BioPython.

ADD REPLY • link 11.8 years ago by kevinjspring ▴ 20

score 0 · Answer 3 · 2013-10-29

0

Entering edit mode

11.8 years ago

Elisabeth Gasteiger ★ 2.4k

See also: UniProt FAQ How can I download the sequences corresponding to a specified domain or region from a list of UniProt entries?

ADD COMMENT • link 11.8 years ago by Elisabeth Gasteiger ★ 2.4k

0

Entering edit mode

Please ask this as a new question, not as an attempted answer to the transmembrane parsing question.

ADD REPLY • link 11.8 years ago by Peter 6.0k