Question

Swiss-Prot Peptide Extraction

0

Entering edit mode

12.2 years ago

cdsouthan ★ 1.9k

I need to extract just the set of "peptide" feature annotations as a FASTA dump, from human Swiss-Prot in the first instance

These are in the "molecule processing" sections in the form Peptide 34 – 42 9 Angiotensin 1-9 PRO_0000420659

The "fragments" are also have the same PRO_ id type so I need to keep these out, as well as signal peptides

Suggestions welcome, even better if someone could just drop them out (less than 500 I guess ?)

• 3.9k views

ADD COMMENT • link updated 11.8 years ago by Elisabeth Gasteiger ★ 2.4k • written 12.2 years ago by cdsouthan ★ 1.9k

0

Entering edit mode

give us one or two example of accession number please.

ADD REPLY • link 12.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

P01019 would be an example.

ADD REPLY • link 12.2 years ago by Neilfws 49k

score 1 · Answer 1 · 2013-06-11

1

Entering edit mode

12.2 years ago

Neilfws 49k

This is quite easy to do using e.g. Bioperl or code from the other Bio* projects. See for example the feature annotation HOW-TO.

Here is another "quick and dirty" solution. Assuming that you have a file in SwissProt (UniProt) format, named P01019.txt, this ugly Perl code will extract the accession and the peptide features:

#!/usr/bin/perl -w
use strict;

my $ac = "";
my $ft = "";
open IN, "P01019.txt";
while(<IN>) {
  chomp;
  if(/^AC\s+(.*?);/) {
    $ac = $1;
  }
  if(/^FT\s+PEPTIDE\s+(\d+)\s+(\d+)\s+/) {
    $ft = "[$1-$2]";
    print "$ac$ft\n";
  }
}

close IN;

Redirect the output to a file, it will look like this:

P01019[34-43]
P01019[34-42]
P01019[34-41]
P01019[34-40]
P01019[34-38]
P01019[34-37]
P01019[35-41]
P01019[36-41]

Now you can upload that file at this URL and it will return the peptide sequences in Fasta format.

ADD COMMENT • link 12.2 years ago by Neilfws 49k

0

Entering edit mode

Thanks, thats helpful but even as much as I could glean from the HOW-TO link its unclear how to select the uniprot entries in the first place i.e. retrieve all the IDs conforming to annotated > human > feature key > molecule processing > (type) peptide > PRO....

Doing some more selects on the UniProt interface, this seems to be getting there (but no one needs to castigate me for partialy answering my own question)

(organism:"Homo sapiens [9606]") AND reviewed:yes AND annotation:(type:peptide confidence:experimental) = 144

Thus I then need the peptide sequence ranges for all the IDs fulfilling the above. So how and where could Niel's script pick up the 144 text files ?

ADD REPLY • link 12.2 years ago by cdsouthan ★ 1.9k

0

Entering edit mode

I assumed from your question - "from human Swiss-Prot" - that your starting data would be something like the human reference proteome set downloaded in SwissProt format. It's not very large, there's no need to start with a subset from an advanced query.

ADD REPLY • link 12.2 years ago by Neilfws 49k

score 1 · Answer 2 · 2013-06-12

The following xsl stylesheet should extract the data from the uniprot xml file:


<xsl:stylesheet xmlns:u="&lt;a href=" http:="" uniprot.org="" uniprot"="" rel="nofollow">http://uniprot.org/uniprot"
    xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0' 
    > 
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:apply-templates select="u:uniprot/u:entry[u:sequence]"/>
</xsl:template>

<xsl:template match="u:entry">
<xsl:apply-templates select="u:feature"/>
</xsl:template>

<xsl:template match="u:feature[u:location]">
<xsl:apply-templates select="u:feature"/>
</xsl:template>

<xsl:template match="u:feature[u:location/u:position]">
<xsl:apply-templates select="." mode="name"/>
<xsl:value-of select="substring(translate(../u:sequence,'
 ',''),number(u:location/u:position/@position),1)"/>
<xsl:text>
</xsl:text>
</xsl:template>

<xsl:template match="u:feature[u:location/u:begin]">
<xsl:variable name="start" select="number(u:location/u:begin/@position)"/>
<xsl:apply-templates select="." mode="name"/>
<xsl:value-of select="substring(translate(../u:sequence,'
 ',''),$start,1 + number(u:location/u:end/@position)- $start)"/>
<xsl:text>
</xsl:text>
</xsl:template>

<xsl:template match="u:feature" mode="name">
<xsl:text>></xsl:text>
<xsl:value-of select="../u:accession[1]"/>
<xsl:if test="@type">
<xsl:text>|type=</xsl:text>
<xsl:value-of select="@type"/>
</xsl:if>
<xsl:if test="@description">
<xsl:text>|description=</xsl:text>
<xsl:value-of select="@description"/>
</xsl:if>
<xsl:if test="@id">
<xsl:text>|id=</xsl:text>
<xsl:value-of select="@id"/>
</xsl:if>
<xsl:text>
</xsl:text>
</xsl:template>

</xsl:stylesheet>

.

xsltproc  features2fasta.xsl "http://www.uniprot.org/uniprot/P01019.xml" 

>P01019|type=signal peptide
MRKRAPQSEMAPAGVSLRATILCLLAWAGLAAG
>P01019|type=chain|description=Angiotensinogen|id=PRO_0000032456
DRVYIHPFHLVIHNESTCEQLAKANAGKPKDPTFIPAPIQAKTSPVDEKALQDQLVLVAAKLDTEDKLRAAMVGMLANFLGFRIYGMHSELWGVVHGATVLSPTAVFGTLASLYLGALDHTADRLQAILGVPWKDKNCTSRLDAHKVLSALQAVQGLLVAQGRADSQAQLLLSTVVGVFTAPGLHLKQPFVQGLALYTPVVLPRSLDFTELDVAAEKIDRFMQAVTGWKTGCSLMGASVDSTLAFNTYVHFQGKMKGFSLLAEPQEFWVDNSTSVSVPMLSGMGTFQHWSDIQDNFSVTQVPFTESACLLLIQPHYASDLDKVEGLTFQQNSLNWMKKLSPRTIHLTMPQLVLQGSYDLQDLLAQAELPAILHTELNLQKLSNDRIRVGEVLNSIFFELEADEREPTESTQQLNKPEVLEVTLNRPFLFAVYDQSATALHFLGRVANPLSTA
>P01019|type=peptide|description=Angiotensin-1|id=PRO_0000032457
DRVYIHPFHL
>P01019|type=peptide|description=Angiotensin 1-9|id=PRO_0000420659
DRVYIHPFH
>P01019|type=peptide|description=Angiotensin-2|id=PRO_0000032458
DRVYIHPF
>P01019|type=peptide|description=Angiotensin 1-7|id=PRO_0000420660
DRVYIHP
>P01019|type=peptide|description=Angiotensin 1-5|id=PRO_0000420661
DRVYI
>P01019|type=peptide|description=Angiotensin 1-4|id=PRO_0000420662
DRVY
>P01019|type=peptide|description=Angiotensin-3|id=PRO_0000032459
RVYIHPF
>P01019|type=peptide|description=Angiotensin-4|id=PRO_0000420663
VYIHPF

score 0 · Answer 3 · 2013-10-29

Sorry for the late answer. If it is still of interest: To retrieve only the peptide sequences, start with this query (as you found out above) in gff format: http://www.uniprot.org/uniprot/?query=%28annotation%3a%28type%3apeptide%29%29+AND+reviewed%3ayes+organism%3a9606&format=gff

Then follow the instructions in this FAQ: http://www.uniprot.org/faq/50 How can I download the sequences corresponding to a specified domain or region from a list of UniProt entries?