Is it possible to download a random set of proteins? (fasta files)
6
2
Entering edit mode
10.4 years ago
arronslacey ▴ 320

Hi - I was wondering if it is possible the download a random sample of proteins from a given protein database. I want to do this to compare proteins of interest to "background proteins". i.e. a control. Probably a little trickier would be to download proteins that aren't of a certain type i.e. non membrane proteins.

Has anyone done anything like this. I see in papers all the time "we used non-XXX proteins as a negative training set. " And I'd imagine something like this would be a pain to do manually.

Ideally I would not like to download entire databases, but rather do this task online.

Anyone done this sort of thing?

protein pdb swissprot uniprot pfam • 5.9k views
ADD COMMENT
2
Entering edit mode

What do you mean by downloading a protein?

AFAIK it is hard to transport amino acids over http.

ADD REPLY
0
Entering edit mode

sorry my bad - fasta files.

ADD REPLY
0
Entering edit mode

Hi Pierre thanks for this, trying out now but getting "ERROR 2003 (HY000): Can't connect to MySQL server on 'genome-mysql.cse.ucsc.edu' (113)"

Probably a firewall issue with my campus so I'll let you know how I get on.

ADD REPLY
0
Entering edit mode

You should edit your question, add something like Edit 1 at the end of it with your progress. Not as an answer (?) Or post a new question if really that doesn't work.

ADD REPLY
0
Entering edit mode

this should be a comment, not an answer. And, yes, it is a problem with the firewall

ADD REPLY
5
Entering edit mode
10.4 years ago
Hugues ▴ 250

Here are all the ~20k reviewed proteins in homo sapiens in the UniProt database:

http://www.uniprot.org/uniprot/?query=%28taxonomy%3A9606%29+AND+reviewed%3Ayes

Play around, exclude some, include non-reviewed or change organism, then click download.

Choose which format suits you, i.e. FASTA.

Here is the first one (just for fun):

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS
WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY
LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY
YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD
AGEGEN

Then load some at random.

ADD COMMENT
1
Entering edit mode

thanks Hugues - I think will end up combining your query with Pierre's implementation of using a mysql interface. Alex's suggestion about using "sample" is interesting too

ADD REPLY
1
Entering edit mode

I also like that I am able to add family NOT XXX.

This is really helpful when I want to pick out proteins that aren't in a certain family.

ADD REPLY
3
Entering edit mode
10.4 years ago

If you pull one sample, it may not accurately reflect your background. By random chance, that one sample may not distinguish your proteins-of-interest from true background. Maybe you'll get lucky.

If you have a file containing your "universe" of proteins (e.g., all proteins except for membrane proteins, or whatever), and the FASTA headers and sequences are on alternate lines (or can be preprocessed to have that structure), then you can use a command-line program like sample to quickly extract a body of samples that more accurately define your background - say, 100 samples of 50 proteins, uniformly sampled at random without replacement:

$ for padded_idx in $(seq -f "%03g" 0 99); do \
    sample --lines-per-offset=2 --sample-size=50 allProteins.fasta \
    > sample_${padded_idx}.fasta; \
done

Then you can analyze all of sample_*.fasta for their expected characteristics.

ADD COMMENT
0
Entering edit mode

Thanks alex - I'll definitely be using this once I have my proteins.

ADD REPLY
3
Entering edit mode
10.4 years ago

Using UCSC mysql for uniprot (slow!):

 mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -D uniProt -e 'select concat(">",F.acc),SEQ.val,countC.id) as TOTAL,rand()   from (feature as F,protein as SEQ) left join featureClass as C on C.id=F.featureClass and C.val="TRANSMEM" where SEQ.acc=F.acc group by F.acc having TOTAL=0 order by 4  limit 10' |cut -f 1,2 | tr "\t" "\n" | fold -w 60

>A1INK5
RYADYPEVHTTWNVISTIGSTISFLGILYFFYIIWESLITQRMVLFSIQLCSSIEWLQNS
PPAEHSYSELPLLTNF
>K1SNC9
PYAVTKADFISERKHNSLPVMPNAYGHTEYDENGNTYLCFDKGESDDFVHSYAVFYSDGT
RYDYFSDFYKGISSMADEVKLPVYSKSPGVYNIKVYAIDSYGSISDSYTSIDRSEVRRRK
TYRRKLAPEIKY
>Q2YN00
MNNGTSPAGGETEATQTRSGFVALIGAPNAGKSTLVNQLVGTKVSIVTHKVQTTRALVRG
IFIEGPAQIVLVDTPGIFRPKRRLDRAMVTTAWGGAKDADIILVIIDAQGGFNENAEALL
ESMKDIRQKKVLVLNKVDRVDPPVLLSLAQKANELVPFDRTFMISALNGSGCKDLAKYLA
ESVPNGPWYYPEDQISDMPMRQLAAEITREKLYLRLHEELPYASTVETERWEERKDGSVR
IEQVIYVERESQKKIVLGHKGETVKAIGQAARKEISEILEQTVHLFLFVKVRENWGNDPE
RYREMGLDFPT
>J2I554
MSESDTVLSAFAGGVLTLALNRPDKLNAFNEEMHLALRAGFERAHQDASVRAVLLTGAGR
AFCAGQDLGDRDPRNGGAAPDLGRTIELFYNPLIRLIRTLEKPVICAVNGVAAGAGANIA
LACDITLAARSARFIQAFAKIGLVPDSGGTWSLPRLLGEARAKALALTAEPLDAETAASW
GLIWKAVDDAELLDEANTLATRLAAGPTKGLGMTKRAIQAAATNSLDEQLELERDLQREA
GRSADYAEGVLAFLEKRKPEFKGQ
>F1BDQ6
MEEIQRYLQLERSEQHDFLYPLIFQEYIYAFAHDRGFNRSILSENPGYDNKSSLLIVKRL
ISRMYQQNHFLISPNDWNQNPFWVRNRNFYSQIISEGFAFIVEIPFSRRLISCLEEKDSQ
ISEFTINSFNISLFRGQFFTSKFSIRYTNTPPCPWGNLGSNSSLXXXNEYCNCNSLITPT
KASSSFLKRNQRLFLFLYNSHVSEYESIFVFLRNQSSHLRSTSSGVLLERIYFYRKIKRL
VNVFLKVKDFQANLCLGNEPCMHSIRYQRKSSLASKGTSLSMNKWKCYLVTFWQWHFSLW
FHPRRIYINQLSNHSLDFLGYHSSVRMNSSMVRSQILENSFRINNAIKKFDTLLPIIPMI
SSLAKAKFCNVLGHPISKPVRADLSDSN
>G3MEH5
EDGINQVQSSVAEYPEAITYLLEQYDKYEAEQLRLSDIISGFIDPNETDDVAPTATHIGS
ELSEEDLADEDEDEDEDEDGDGDDSDDDGDGGPDPEVAREKFGELRAQYEVTRLSIQQNG
RAHEDTQNAIAQLADVFRQFRLMPKQFDRLVNNMREMMERVRVQERIIMKLCVEQAKMPK
KTFVAAFTNNECETAWFEYQKQAGKAWSPRLVEMDEDVLRAIGKLQQIEEET
>T1UMN5
MFPILSQFLNSGQQTIRAARYIGQGFMITLSHANRLPVTIQYPYEKLITSDRFRGRIHFE
FDKCIACEVCVRVCPIDLPVVDWKLEINIRKKRLLNYSIDFGICIFCGNCVEYCPTNCLS
MTEEYELSTYDRHELNYNQIALGRLPVSIIDDYTIRTISSNSPQIKNV
>S3GGS4
MKHVLSIQSHVVYGYAGNKSATFPMQLLGVDVWALNTVQFSNHTQYGKWTGMVIPKEQIG
EIVRGIDAIEALHLCDAIVSGYIGSAEQVEEIVNAVRFIKSKNPNALYLCDPVMGHPDKG
CIVAEGVKEGLINLAMAEADLITPNLVELRELSGLPVENFAQAQDAVRAILAKGPKKVLV
KHLSKVGKDSSQFEMLLATKDGMWHISRPLHQFRKEPVGVGDLTAGLFIANLLNGKSDIE
AFEHTANAVNDVMTVTQQKDNYELQIIAAREYIMQPSSQYKAVKIA
>I2I5C1
MARIIVVTSGKGGVGKTTSSAAIATGLAQKGKKTVVIDFDIGLRNLDLIMGCERRVVYDF
VNVIQGDATLNQALIKDKRTENLYILPASQTRDKDALTREGVAKVLDDLKAMDFEFIVCD
SPAGIETGALMALYFADEAIITTNPEVSSVRDSDRILGILASKSRRAENGEEPIKEHLLL
TRYNPGRVSRGDMLSMEDVLEILRIKLVGVIPEDQSVLRASNQGEPVILDINADAGKAYA
DTVERLLGEERPFRFIEEEKKGFLKRLFGG
>V1T7L7
KYYMDDITQENVMSFLTPVYLAGTLKGIVMVDVNQDNLKIFLYPGPSAGLALS
ADD COMMENT
3
Entering edit mode
10.4 years ago

On the UniProt web site, http://www.uniprot.org, you can add &random=yes to any query (see here).

The following query returns a random reviewed human entry:

http://www.uniprot.org/uniprot/?query=reviewed:yes+AND+organism:9606&random=yes

However, as far as using UniProtKB/Swiss-Prot to build negative data sets is concerned, please read this FAQ

ADD COMMENT
0
Entering edit mode
10.4 years ago
Hugues ▴ 250

Here is some python code:

from Bio import ExPASy
from Bio import SwissProt
handle = ExPASy.get_sprot_raw('B5ZC00')
record = SwissProt.read(handle)
print record.sequence

To improve things further, you'll need to write a file with the ID of your proteins, and feed them (or some of them) to your query. Write the output in a FASTA file (with a header) instead of the prompt.

ADD COMMENT
0
Entering edit mode
10.4 years ago
5heikki 11k

Generate two lists of random numbers with e.g. $RANDOM, multiply them, and treat the results as GI numbers and pull them from nr with blastdbcmd

ADD COMMENT
0
Entering edit mode

Are all possible GI numbers attributed?

ADD REPLY
0
Entering edit mode

How to exclude membrane proteins? No, I think that a list is the way to go.

ADD REPLY
0
Entering edit mode

It could be though that choosing randomly might work, and then filtering against membrane proteins. would be difficult though as would need a list of all known membrane proteins

ADD REPLY

Login before adding your answer.

Traffic: 1804 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6