Identify all proteins smaller than 150 AA
3
0
Entering edit mode
7.9 years ago

Hi, I'm looking to extract protein ID and sequence based on their size. More specifically I would like to identify in different genome all the protein with a size in a range of 40-150 AA. Any idea? Thanks

protein sequence database query • 1.9k views
ADD COMMENT
0
Entering edit mode

What do you mean with Identify? You want to download them from for example from NCBI for different organisms or do you mean something else?

ADD REPLY
0
Entering edit mode

More specifically I would like to identify in different genome all the protein with a size in a range of 40-150 AA.

This question (in present form) is not logical. Practically, every known genome is likely to have protein(s) that fall in the range of 40-150 AA. You need to specify some additional criteria to narrow the selection.

You may also want to do this search using well known/annotated proteins from UniProt, specifically SWISSPROT.

ADD REPLY
1
Entering edit mode
7.9 years ago

The corresponding query in the UniProt Knowledgebase is

length:[40 TO 150]

http://www.uniprot.org/uniprot/?query=length%3A[40+TO+150]&sort=score

You can use the Advanced Search to obtain this query (first select "Sequence", then "Length" and specify your ranges).

ADD COMMENT
0
Entering edit mode
7.9 years ago
j_susat ▴ 40

Ok,

I guess since you tagged database and query it is about downloading from a database. So here is an Idea how you could do that with Entrez Direct:

esearch -db protein -query "Staphylococcus aureus [ORGN]" | efilter -query "40:150 [SLEN]" | efetch -format fasta > aureus_protein_test

In this case Staph aureus is just an example. You just have to place your desired Organism name there and then you are good to go. And if you have a list of different Organisms you could read the list in a loop and download the desired proteins for every organism with one command.

Here are some infos about Entrez Direct

ADD COMMENT
0
Entering edit mode
7.9 years ago

Thank you. I'm surprise to see 316536 references. Could it be possible to eliminate duplicates and restrict the search to secreted proteines?

ADD COMMENT
0
Entering edit mode

Please use ADD COMMENT to answer to earlier replies, as such this thread remains logically structured and easy to follow.

ADD REPLY
0
Entering edit mode

You could try this:

esearch -db protein -query "Staphylococcus aureus [ORGN] AND refseq[filter]" | efilter -query "40:150 [SLEN] AND secretion [ALL]" | efetch -format fasta  > aureus_protein_test

it is a bit more stringent due to refseq and secretion restricitons. I guess there is a better way to search in every field for secretion but I have no clue at the moment.

ADD REPLY

Login before adding your answer.

Traffic: 1623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6