How To Retrive A Batch Of Transmembrane Domains From Uniprot?
3
0
Entering edit mode
11.1 years ago
kevinjspring ▴ 20

I want to download a batch of sequence data from UniProt but I only want transmembrane annotated regions. On the UniProt website I am able to go into each individual entry and under 'Sequence Annotation (Features)' I can retrieve only the specific area of the sequence I want. This is helpful but I need to do this to many entries so I was looking to see if there is a batch option to retrieve. Any tips on how to download a batch of protein sequences that only contain a specific annotated region?

Example:

LAT, 4F2hc, and LAX are all integral, single-pass transmembrane proteins. The full FASTA sequence for these three proteins are:

>sp|O43561|LAT_HUMAN Linker for activation of T-cells family member 1 OS=Homo sapiens GN=LAT PE=1 SV=1
MEEAILVPCVLGLLLLPILAMLMALCVHCHRLPGSYDSTSSDSLYPRGIQFKRPHTVAPW
PPAYPPVTSYPPLSQPDLLPIPRSPQPLGGSHRTPSSRRDSDGANSVASYENEGASGIRG
AQAGWGVWGPSWTRLTPVSLPPEPACEDADEDEDDYHNPGYLVVLPDSTPATSTAAPSAP
ALSTPGIRDSAFSMESIDDYVNVPESGESAEASLDGSREYVNVSQELHPGAAKTEPAALS
SQEAEEVEEEGAPDYENLQELN
>sp|P08195|4F2_HUMAN 4F2 cell-surface antigen heavy chain OS=Homo sapiens GN=SLC3A2 PE=1 SV=3
MELQPPEASIAVVSIPRQLPGSHSEAGVQGLSAGDDSELGSHCVAQTGLELLASGDPLPS
ASQNAEMIETGSDCVTQAGLQLLASSDPPALASKNAEVTGTMSQDTEVDMKEVELNELEP
EKQPMNAASGAAMSLAGAEKNGLVKIKVAEDEAEAAAAAKFTGLSKEELLKVAGSPGWVR
TRWALLLLFWLGWLGMLAGAVVIIVRAPRCRELPAQKWWHTGALYRIGDLQAFQGHGAGN
LAGLKGRLDYLSSLKVKGLVLGPIHKNQKDDVAQTDLLQIDPNFGSKEDFDSLLQSAKKK
SIRVILDLTPNYRGENSWFSTQVDTVATKVKDALEFWLQAGVDGFQVRDIENLKDASSFL
AEWQNITKGFSEDRLLIAGTNSSDLQQILSLLESNKDLLLTSSYLSDSGSTGEHTKSLVT
QYLNATGNRWCSWSLSQARLLTSFLPAQLLRLYQLMLFTLPGTPVFSYGDEIGLDAAALP
GQPMEAPVMLWDESSFPDIPGAVSANMTVKGQSEDPGSLLSLFRRLSDQRSKERSLLHGD
FHAFSAGPGLFSYIRHWDQNERFLVVLNFGDVGLSAGLQASDLPASASLPAKADLLLSTQ
PGREEGSPLELERLKLEPHEGLLLRFPYAA
>sp|Q58CT8|LAX1_BOVIN Lymphocyte transmembrane adapter 1 OS=Bos taurus GN=LAX1 PE=2 SV=1
MDVTTSAWSETTRRISEPSTLQGTLGSLDKAEDHSSSIFSGFAALLAILLVVAVICVLWC
CGKRKKRQVPYLRVTIMPLLTLPRPRQRAKNIYDLLPRRQEELGRHPSRSIRIVSTESLL
SRNSDSPSSEHVPSRAGDALHMHRAHTHAMGYAVGIYDNAMRPQMCGNLAPSPHYVNVRA
SRGSPSTSSEDSRDYVNIPTAKEIAETLASASNPPRNLFILPGTKELAPSEEIDEGCGNA
SDCTSLGSPGTENSDPLSDGEGSSQTSNDYVNMAELDLGTPQGKQLQGMFQCRRDYENVP
PGPSSNKQQEEEVTSSNTDHVEGRTDGPETHTPPAVQSGSFLALKDHVACQSSAHSETGP
WEDAEETSSEDSHDYENVCAAEAGARG

The data I want is to be able to retrieve from the UniProt site is:

>sp|O43561|5-27
ILVPCVLGLLLLPILAMLMALCV
>sp|P08195|185-205
LLLLFWLGWLGMLAGAVVIIV
>sp|Q58CT8|38-58
IFSGFAALLAILLVVAVICVL

Which corresponds to the single transmembrane domain located in that protein.

The XML data that lists the TM annotation is:

<feature type="transmembrane region" description="Helical; Signal-anchor for type II membrane protein;" status="potential"><location><begin position="185"/><end position="205"/></location></feature>

I might be able to parse this and then use the position data to save only the sequence data needed. Does Biopython have this parser yet?

uniprot biopython • 6.6k views
ADD COMMENT
1
Entering edit mode

You can download the records as UniProt XML, or the old "SwissProt" plain text, and parse them locally to look for transmembrane domains & then extract the sequence for them. At least that's what I would try using Biopython.

Could you give a specific example (e.g. a UniProt protein ID where there are 3 transmembrane domains) and the desired output (e.g. a FASTA file with the region containing the three transmembrane domains only)?

ADD REPLY
0
Entering edit mode

I am primarily interested in single-pass transmembrane proteins.

ADD REPLY
0
Entering edit mode

I updated with some example data. Does BioPython have a parser for XML data from UniProt?

ADD REPLY
1
Entering edit mode

Yes, "uniprot-xml" and "swiss" (plain text) are available in Biopython's Bio.SeqIO module, see http://biopython.org/wiki/SeqIO

ADD REPLY
3
Entering edit mode
11.1 years ago
Peter 6.0k

Using the plain text SwissProt format, something like this using Biopython?

# Hard coded list, could use os.listdir(...) or glob?                                                           
filenames = ["O43561.txt", "P08195.txt", "Q58CT8.txt"]
input_format = "swiss"
feature_type = "TRANSMEM"
output_filename = "swiss_tm.fasta"

#Real code starts here...
from Bio import SeqIO
output = open(output_filename, "w")
for filename in filenames:
    # Using SeqIO.parse will cope with multi-record files
    for record in SeqIO.parse(filename, input_format):
        for f in record.features:
            if f.type == feature_type:
                title = "sp|%s|%i-%i" % record.id, f.location.start+1, f.location.end)
                output.write(">%s\n%s\n" % (title, f.extract(record.seq)))
output.close()

Or, using the UniProt XML format, change these lines:

filenames = ["O43561.xml", "P08195.xml", "Q58CT8.xml"]
input_format = "uniprot-xml"
feature_type = "transmembrane region"
output_filename = "uniprot_tm.fasta"

Either should give this as the FASTA format output:

>sp|O43561|5-27
ILVPCVLGLLLLPILAMLMALCV
>sp|P08195|185-205
LLLLFWLGWLGMLAGAVVIIV
>sp|Q58CT8|38-58
IFSGFAALLAILLVVAVICVL

Note I have not talked about how to automatically download the SwissProt/UniProt files, which would be a separate question.

ADD COMMENT
2
Entering edit mode
11.1 years ago

See my answer for How to retrieve human proteins sequence containing a given domain

java Biostar5862 -t "transmembrane region"
>[11011_ASFP4]|26-46
PFGCNMKGLGVLLGLFSLILA
>[11011_ASFP4]|154-174
LTLKQYCLYFIISIAFAGCFV
>[11011_ASFP4]|183-203
LNTTIKLLTLLSILVYLAQPV
>[141R_IIV6]|49-69
YIIYAIVAAILLLLFWLLYKK
>[14KD_RHOSH]|85-102
LGGFASGALLALALAGIF
>[1A29_HUMAN]|309-332
VGIIAGLVLFGAVFAGAVVAAVRW
>[1B01_PANTR]|306-329
GIVAGLAVLVVTVAVVAVVAAVMC
>[1B54_HUMAN]|309-332
VGIVAGLAVLAVVVIGAVVATVMC
>[1C18_HUMAN]|309-333
VGIVAGLAVLVVLAVLGAVVAVVMC
>[34KD_MYCPA]|42-62
IAVVALGFAAYLLNFGPTFTI
ADD COMMENT
0
Entering edit mode

I don't have any experience with Java. I will give it a try, but I was hoping there was something I could use with BioPython.

ADD REPLY
0
0
Entering edit mode

Please ask this as a new question, not as an attempted answer to the transmembrane parsing question.

ADD REPLY

Login before adding your answer.

Traffic: 2580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6