Swiss-Prot Peptide Extraction
3
0
Entering edit mode
11.5 years ago
cdsouthan ★ 1.9k

I need to extract just the set of "peptide" feature annotations as a FASTA dump, from human Swiss-Prot in the first instance

These are in the "molecule processing" sections in the form Peptide 34 – 42 9 Angiotensin 1-9 PRO_0000420659

The "fragments" are also have the same PRO_ id type so I need to keep these out, as well as signal peptides

Suggestions welcome, even better if someone could just drop them out (less than 500 I guess ?)

• 3.6k views
ADD COMMENT
0
Entering edit mode

give us one or two example of accession number please.

ADD REPLY
0
Entering edit mode

P01019 would be an example.

ADD REPLY
1
Entering edit mode
11.5 years ago
Neilfws 49k

This is quite easy to do using e.g. Bioperl or code from the other Bio* projects. See for example the feature annotation HOW-TO.

Here is another "quick and dirty" solution. Assuming that you have a file in SwissProt (UniProt) format, named P01019.txt, this ugly Perl code will extract the accession and the peptide features:

#!/usr/bin/perl -w
use strict;

my $ac = "";
my $ft = "";
open IN, "P01019.txt";
while(<IN>) {
  chomp;
  if(/^AC\s+(.*?);/) {
    $ac = $1;
  }
  if(/^FT\s+PEPTIDE\s+(\d+)\s+(\d+)\s+/) {
    $ft = "[$1-$2]";
    print "$ac$ft\n";
  }
}

close IN;

Redirect the output to a file, it will look like this:

P01019[34-43]
P01019[34-42]
P01019[34-41]
P01019[34-40]
P01019[34-38]
P01019[34-37]
P01019[35-41]
P01019[36-41]

Now you can upload that file at this URL and it will return the peptide sequences in Fasta format.

ADD COMMENT
0
Entering edit mode

Thanks, thats helpful but even as much as I could glean from the HOW-TO link its unclear how to select the uniprot entries in the first place i.e. retrieve all the IDs conforming to annotated > human > feature key > molecule processing > (type) peptide > PRO....

Doing some more selects on the UniProt interface, this seems to be getting there (but no one needs to castigate me for partialy answering my own question)

(organism:"Homo sapiens [9606]") AND reviewed:yes AND annotation:(type:peptide confidence:experimental) = 144

Thus I then need the peptide sequence ranges for all the IDs fulfilling the above. So how and where could Niel's script pick up the 144 text files ?

ADD REPLY
0
Entering edit mode

I assumed from your question - "from human Swiss-Prot" - that your starting data would be something like the human reference proteome set downloaded in SwissProt format. It's not very large, there's no need to start with a subset from an advanced query.

ADD REPLY
1
Entering edit mode
11.5 years ago

The following xsl stylesheet should extract the data from the uniprot xml file:


<xsl:stylesheet xmlns:u="&lt;a href=" http:="" uniprot.org="" uniprot"="" rel="nofollow">http://uniprot.org/uniprot"
    xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0' 
    > 
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:apply-templates select="u:uniprot/u:entry[u:sequence]"/>
</xsl:template>

<xsl:template match="u:entry">
<xsl:apply-templates select="u:feature"/>
</xsl:template>

<xsl:template match="u:feature[u:location]">
<xsl:apply-templates select="u:feature"/>
</xsl:template>

<xsl:template match="u:feature[u:location/u:position]">
<xsl:apply-templates select="." mode="name"/>
<xsl:value-of select="substring(translate(../u:sequence,'
 ',''),number(u:location/u:position/@position),1)"/>
<xsl:text>
</xsl:text>
</xsl:template>

<xsl:template match="u:feature[u:location/u:begin]">
<xsl:variable name="start" select="number(u:location/u:begin/@position)"/>
<xsl:apply-templates select="." mode="name"/>
<xsl:value-of select="substring(translate(../u:sequence,'
 ',''),$start,1 + number(u:location/u:end/@position)- $start)"/>
<xsl:text>
</xsl:text>
</xsl:template>

<xsl:template match="u:feature" mode="name">
<xsl:text>></xsl:text>
<xsl:value-of select="../u:accession[1]"/>
<xsl:if test="@type">
<xsl:text>|type=</xsl:text>
<xsl:value-of select="@type"/>
</xsl:if>
<xsl:if test="@description">
<xsl:text>|description=</xsl:text>
<xsl:value-of select="@description"/>
</xsl:if>
<xsl:if test="@id">
<xsl:text>|id=</xsl:text>
<xsl:value-of select="@id"/>
</xsl:if>
<xsl:text>
</xsl:text>
</xsl:template>

</xsl:stylesheet>

.

xsltproc  features2fasta.xsl "http://www.uniprot.org/uniprot/P01019.xml" 

>P01019|type=signal peptide
MRKRAPQSEMAPAGVSLRATILCLLAWAGLAAG
>P01019|type=chain|description=Angiotensinogen|id=PRO_0000032456
DRVYIHPFHLVIHNESTCEQLAKANAGKPKDPTFIPAPIQAKTSPVDEKALQDQLVLVAAKLDTEDKLRAAMVGMLANFLGFRIYGMHSELWGVVHGATVLSPTAVFGTLASLYLGALDHTADRLQAILGVPWKDKNCTSRLDAHKVLSALQAVQGLLVAQGRADSQAQLLLSTVVGVFTAPGLHLKQPFVQGLALYTPVVLPRSLDFTELDVAAEKIDRFMQAVTGWKTGCSLMGASVDSTLAFNTYVHFQGKMKGFSLLAEPQEFWVDNSTSVSVPMLSGMGTFQHWSDIQDNFSVTQVPFTESACLLLIQPHYASDLDKVEGLTFQQNSLNWMKKLSPRTIHLTMPQLVLQGSYDLQDLLAQAELPAILHTELNLQKLSNDRIRVGEVLNSIFFELEADEREPTESTQQLNKPEVLEVTLNRPFLFAVYDQSATALHFLGRVANPLSTA
>P01019|type=peptide|description=Angiotensin-1|id=PRO_0000032457
DRVYIHPFHL
>P01019|type=peptide|description=Angiotensin 1-9|id=PRO_0000420659
DRVYIHPFH
>P01019|type=peptide|description=Angiotensin-2|id=PRO_0000032458
DRVYIHPF
>P01019|type=peptide|description=Angiotensin 1-7|id=PRO_0000420660
DRVYIHP
>P01019|type=peptide|description=Angiotensin 1-5|id=PRO_0000420661
DRVYI
>P01019|type=peptide|description=Angiotensin 1-4|id=PRO_0000420662
DRVY
>P01019|type=peptide|description=Angiotensin-3|id=PRO_0000032459
RVYIHPF
>P01019|type=peptide|description=Angiotensin-4|id=PRO_0000420663
VYIHPF
ADD COMMENT
0
Entering edit mode

Thanks, but can can we add NOT chain, NOT fragment and NOT signal ?

ADD REPLY
0
Entering edit mode
11.1 years ago

Sorry for the late answer. If it is still of interest: To retrieve only the peptide sequences, start with this query (as you found out above) in gff format: http://www.uniprot.org/uniprot/?query=%28annotation%3a%28type%3apeptide%29%29+AND+reviewed%3ayes+organism%3a9606&format=gff

Then follow the instructions in this FAQ: http://www.uniprot.org/faq/50 How can I download the sequences corresponding to a specified domain or region from a list of UniProt entries?

ADD COMMENT

Login before adding your answer.

Traffic: 2935 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6