Hi, I'm looking to extract protein ID and sequence based on their size. More specifically I would like to identify in different genome all the protein with a size in a range of 40-150 AA. Any idea? Thanks
Hi, I'm looking to extract protein ID and sequence based on their size. More specifically I would like to identify in different genome all the protein with a size in a range of 40-150 AA. Any idea? Thanks
The corresponding query in the UniProt Knowledgebase is
length:[40 TO 150]
http://www.uniprot.org/uniprot/?query=length%3A[40+TO+150]&sort=score
You can use the Advanced Search to obtain this query (first select "Sequence", then "Length" and specify your ranges).
Ok,
I guess since you tagged database and query it is about downloading from a database. So here is an Idea how you could do that with Entrez Direct:
esearch -db protein -query "Staphylococcus aureus [ORGN]" | efilter -query "40:150 [SLEN]" | efetch -format fasta > aureus_protein_test
In this case Staph aureus is just an example. You just have to place your desired Organism name there and then you are good to go. And if you have a list of different Organisms you could read the list in a loop and download the desired proteins for every organism with one command.
Thank you. I'm surprise to see 316536 references. Could it be possible to eliminate duplicates and restrict the search to secreted proteines?
You could try this:
esearch -db protein -query "Staphylococcus aureus [ORGN] AND refseq[filter]" | efilter -query "40:150 [SLEN] AND secretion [ALL]" | efetch -format fasta > aureus_protein_test
it is a bit more stringent due to refseq and secretion restricitons. I guess there is a better way to search in every field for secretion but I have no clue at the moment.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What do you mean with Identify? You want to download them from for example from NCBI for different organisms or do you mean something else?
This question (in present form) is not logical. Practically, every known genome is likely to have protein(s) that fall in the range of 40-150 AA. You need to specify some additional criteria to narrow the selection.
You may also want to do this search using well known/annotated proteins from UniProt, specifically SWISSPROT.