I want to download a batch of sequence data from UniProt but I only want transmembrane annotated regions. On the UniProt website I am able to go into each individual entry and under 'Sequence Annotation (Features)' I can retrieve only the specific area of the sequence I want. This is helpful but I need to do this to many entries so I was looking to see if there is a batch option to retrieve. Any tips on how to download a batch of protein sequences that only contain a specific annotated region?
Example:
LAT, 4F2hc, and LAX are all integral, single-pass transmembrane proteins. The full FASTA sequence for these three proteins are:
>sp|O43561|LAT_HUMAN Linker for activation of T-cells family member 1 OS=Homo sapiens GN=LAT PE=1 SV=1
MEEAILVPCVLGLLLLPILAMLMALCVHCHRLPGSYDSTSSDSLYPRGIQFKRPHTVAPW
PPAYPPVTSYPPLSQPDLLPIPRSPQPLGGSHRTPSSRRDSDGANSVASYENEGASGIRG
AQAGWGVWGPSWTRLTPVSLPPEPACEDADEDEDDYHNPGYLVVLPDSTPATSTAAPSAP
ALSTPGIRDSAFSMESIDDYVNVPESGESAEASLDGSREYVNVSQELHPGAAKTEPAALS
SQEAEEVEEEGAPDYENLQELN
>sp|P08195|4F2_HUMAN 4F2 cell-surface antigen heavy chain OS=Homo sapiens GN=SLC3A2 PE=1 SV=3
MELQPPEASIAVVSIPRQLPGSHSEAGVQGLSAGDDSELGSHCVAQTGLELLASGDPLPS
ASQNAEMIETGSDCVTQAGLQLLASSDPPALASKNAEVTGTMSQDTEVDMKEVELNELEP
EKQPMNAASGAAMSLAGAEKNGLVKIKVAEDEAEAAAAAKFTGLSKEELLKVAGSPGWVR
TRWALLLLFWLGWLGMLAGAVVIIVRAPRCRELPAQKWWHTGALYRIGDLQAFQGHGAGN
LAGLKGRLDYLSSLKVKGLVLGPIHKNQKDDVAQTDLLQIDPNFGSKEDFDSLLQSAKKK
SIRVILDLTPNYRGENSWFSTQVDTVATKVKDALEFWLQAGVDGFQVRDIENLKDASSFL
AEWQNITKGFSEDRLLIAGTNSSDLQQILSLLESNKDLLLTSSYLSDSGSTGEHTKSLVT
QYLNATGNRWCSWSLSQARLLTSFLPAQLLRLYQLMLFTLPGTPVFSYGDEIGLDAAALP
GQPMEAPVMLWDESSFPDIPGAVSANMTVKGQSEDPGSLLSLFRRLSDQRSKERSLLHGD
FHAFSAGPGLFSYIRHWDQNERFLVVLNFGDVGLSAGLQASDLPASASLPAKADLLLSTQ
PGREEGSPLELERLKLEPHEGLLLRFPYAA
>sp|Q58CT8|LAX1_BOVIN Lymphocyte transmembrane adapter 1 OS=Bos taurus GN=LAX1 PE=2 SV=1
MDVTTSAWSETTRRISEPSTLQGTLGSLDKAEDHSSSIFSGFAALLAILLVVAVICVLWC
CGKRKKRQVPYLRVTIMPLLTLPRPRQRAKNIYDLLPRRQEELGRHPSRSIRIVSTESLL
SRNSDSPSSEHVPSRAGDALHMHRAHTHAMGYAVGIYDNAMRPQMCGNLAPSPHYVNVRA
SRGSPSTSSEDSRDYVNIPTAKEIAETLASASNPPRNLFILPGTKELAPSEEIDEGCGNA
SDCTSLGSPGTENSDPLSDGEGSSQTSNDYVNMAELDLGTPQGKQLQGMFQCRRDYENVP
PGPSSNKQQEEEVTSSNTDHVEGRTDGPETHTPPAVQSGSFLALKDHVACQSSAHSETGP
WEDAEETSSEDSHDYENVCAAEAGARG
The data I want is to be able to retrieve from the UniProt site is:
>sp|O43561|5-27
ILVPCVLGLLLLPILAMLMALCV
>sp|P08195|185-205
LLLLFWLGWLGMLAGAVVIIV
>sp|Q58CT8|38-58
IFSGFAALLAILLVVAVICVL
Which corresponds to the single transmembrane domain located in that protein.
The XML data that lists the TM annotation is:
<feature type="transmembrane region" description="Helical; Signal-anchor for type II membrane protein;" status="potential"><location><begin position="185"/><end position="205"/></location></feature>
I might be able to parse this and then use the position data to save only the sequence data needed. Does Biopython have this parser yet?
You can download the records as UniProt XML, or the old "SwissProt" plain text, and parse them locally to look for transmembrane domains & then extract the sequence for them. At least that's what I would try using Biopython.
Could you give a specific example (e.g. a UniProt protein ID where there are 3 transmembrane domains) and the desired output (e.g. a FASTA file with the region containing the three transmembrane domains only)?
I am primarily interested in single-pass transmembrane proteins.
I updated with some example data. Does BioPython have a parser for XML data from UniProt?
Yes, "uniprot-xml" and "swiss" (plain text) are available in Biopython's
Bio.SeqIO
module, see http://biopython.org/wiki/SeqIO