download protein sequences form uniprot that do not contain ambiguous amino acids
3
1
Entering edit mode
10.2 years ago
arronslacey ▴ 320

Hi - I need to download random protein sequences from uniprot. I can do this quite easily using something like reviewed:yes&random=yes

However I do not want to retrieve any sequences that include ambiguous amino acids such as 'B' or 'X'. Is there a search parameter that can specify this? Or will I need to check and throw away a sequence if it contains them?

Thanks.

uniprot protein • 3.3k views
ADD COMMENT
1
Entering edit mode
9.0 years ago
me ▴ 760

This can be done at the UniProt sparql endpoint using this query.

PREFIX up:<http://purl.uniprot.org/core/>
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?protein ?sequence
WHERE {
?protein up:reviewed true .
?protein up:sequence ?sequenceConcept .
?sequenceConcept rdf:value ?sequence .
FILTER ( ! contains(?sequence, 'X')) .
FILTER ( ! contains(?sequence, 'B')) .
FILTER ( ! contains(?sequence, 'Z')) .
}

You can get a nearly FASTA file from the CSV download using this query (just strip the quotes out)

PREFIX up:<http://purl.uniprot.org/core/>
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?fasta
WHERE {
?protein up:reviewed true .
?protein up:sequence ?sequenceConcept .
?sequenceConcept rdf:value ?sequence .
FILTER ( ! contains(?sequence, 'X')) .
FILTER ( ! contains(?sequence, 'B')) .
FILTER ( ! contains(?sequence, 'Z')) .
BIND(concat('>',SUBSTR(STR(?protein),33),'\n',?sequence) AS ?fasta)
}
ADD COMMENT
0
Entering edit mode
10.2 years ago

On the UniProt website there is no way to use text search criteria to query for (or exclude) entries whose sequence contain a certain type of amino acid.

UniProtKB/Swiss-Prot release 2014_09 (due out tomorrow) will contain 2360 sequences with 'B', 'X' or 'Z'. You could either find these 2360 entries first (in a local copy of UniProtKB/Swiss-Prot) and then discard them if your website query happens to retrieve one of them, or, as you suggest, inspect the sequences of your retrieved entries and discard them if they contain the unwanted amino acids.

ADD COMMENT
0
Entering edit mode
9.0 years ago

using my tool https://github.com/lindenb/jvarkit/wiki/UniprotFilterJS

$ cat filter.js
function accept(e)
    {
    var s = e.sequence.value.toUpperCase();
    return (s.indexOf("X")==-1 && s.indexOf("B")==-1);
    }
accept(entry);

and

curl -skL "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz" | gunzip -c |java -jar dist-1.139/uniprotfilterjs.jar  -f filter.js  > filtered.xml
ADD COMMENT

Login before adding your answer.

Traffic: 2736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6