Question

Is it possible to download a random set of proteins? (fasta files)

2

Entering edit mode

10.4 years ago

arronslacey ▴ 320

Hi - I was wondering if it is possible the download a random sample of proteins from a given protein database. I want to do this to compare proteins of interest to "background proteins". i.e. a control. Probably a little trickier would be to download proteins that aren't of a certain type i.e. non membrane proteins.

Has anyone done anything like this. I see in papers all the time "we used non-XXX proteins as a negative training set. " And I'd imagine something like this would be a pain to do manually.

Ideally I would not like to download entire databases, but rather do this task online.

Anyone done this sort of thing?

protein pdb swissprot uniprot pfam • 5.9k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by arronslacey ▴ 320

2

Entering edit mode

What do you mean by downloading a protein?

AFAIK it is hard to transport amino acids over http.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Hugues ▴ 250

0

Entering edit mode

sorry my bad - fasta files.

ADD REPLY • link 10.4 years ago by arronslacey ▴ 320

0

Entering edit mode

Hi Pierre thanks for this, trying out now but getting "ERROR 2003 (HY000): Can't connect to MySQL server on 'genome-mysql.cse.ucsc.edu' (113)"

Probably a firewall issue with my campus so I'll let you know how I get on.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by arronslacey ▴ 320

0

Entering edit mode

You should edit your question, add something like Edit 1 at the end of it with your progress. Not as an answer (?) Or post a new question if really that doesn't work.

ADD REPLY • link 10.4 years ago by Hugues ▴ 250

0

Entering edit mode

this should be a comment, not an answer. And, yes, it is a problem with the firewall

ADD REPLY • link 10.4 years ago by Pierre Lindenbaum 164k

3

Entering edit mode

10.4 years ago

Alex Reynolds 36k

If you pull one sample, it may not accurately reflect your background. By random chance, that one sample may not distinguish your proteins-of-interest from true background. Maybe you'll get lucky.

If you have a file containing your "universe" of proteins (e.g., all proteins except for membrane proteins, or whatever), and the FASTA headers and sequences are on alternate lines (or can be preprocessed to have that structure), then you can use a command-line program like sample to quickly extract a body of samples that more accurately define your background - say, 100 samples of 50 proteins, uniformly sampled at random without replacement:

$ for padded_idx in $(seq -f "%03g" 0 99); do \
    sample --lines-per-offset=2 --sample-size=50 allProteins.fasta \
    > sample_${padded_idx}.fasta; \
done

Then you can analyze all of sample_*.fasta for their expected characteristics.

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks alex - I'll definitely be using this once I have my proteins.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by arronslacey ▴ 320

3

Entering edit mode

10.4 years ago

Pierre Lindenbaum 164k

Using UCSC mysql for uniprot (slow!):

 mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -D uniProt -e 'select concat(">",F.acc),SEQ.val,countC.id) as TOTAL,rand()   from (feature as F,protein as SEQ) left join featureClass as C on C.id=F.featureClass and C.val="TRANSMEM" where SEQ.acc=F.acc group by F.acc having TOTAL=0 order by 4  limit 10' |cut -f 1,2 | tr "\t" "\n" | fold -w 60

>A1INK5
RYADYPEVHTTWNVISTIGSTISFLGILYFFYIIWESLITQRMVLFSIQLCSSIEWLQNS
PPAEHSYSELPLLTNF
>K1SNC9
PYAVTKADFISERKHNSLPVMPNAYGHTEYDENGNTYLCFDKGESDDFVHSYAVFYSDGT
RYDYFSDFYKGISSMADEVKLPVYSKSPGVYNIKVYAIDSYGSISDSYTSIDRSEVRRRK
TYRRKLAPEIKY
>Q2YN00
MNNGTSPAGGETEATQTRSGFVALIGAPNAGKSTLVNQLVGTKVSIVTHKVQTTRALVRG
IFIEGPAQIVLVDTPGIFRPKRRLDRAMVTTAWGGAKDADIILVIIDAQGGFNENAEALL
ESMKDIRQKKVLVLNKVDRVDPPVLLSLAQKANELVPFDRTFMISALNGSGCKDLAKYLA
ESVPNGPWYYPEDQISDMPMRQLAAEITREKLYLRLHEELPYASTVETERWEERKDGSVR
IEQVIYVERESQKKIVLGHKGETVKAIGQAARKEISEILEQTVHLFLFVKVRENWGNDPE
RYREMGLDFPT
>J2I554
MSESDTVLSAFAGGVLTLALNRPDKLNAFNEEMHLALRAGFERAHQDASVRAVLLTGAGR
AFCAGQDLGDRDPRNGGAAPDLGRTIELFYNPLIRLIRTLEKPVICAVNGVAAGAGANIA
LACDITLAARSARFIQAFAKIGLVPDSGGTWSLPRLLGEARAKALALTAEPLDAETAASW
GLIWKAVDDAELLDEANTLATRLAAGPTKGLGMTKRAIQAAATNSLDEQLELERDLQREA
GRSADYAEGVLAFLEKRKPEFKGQ
>F1BDQ6
MEEIQRYLQLERSEQHDFLYPLIFQEYIYAFAHDRGFNRSILSENPGYDNKSSLLIVKRL
ISRMYQQNHFLISPNDWNQNPFWVRNRNFYSQIISEGFAFIVEIPFSRRLISCLEEKDSQ
ISEFTINSFNISLFRGQFFTSKFSIRYTNTPPCPWGNLGSNSSLXXXNEYCNCNSLITPT
KASSSFLKRNQRLFLFLYNSHVSEYESIFVFLRNQSSHLRSTSSGVLLERIYFYRKIKRL
VNVFLKVKDFQANLCLGNEPCMHSIRYQRKSSLASKGTSLSMNKWKCYLVTFWQWHFSLW
FHPRRIYINQLSNHSLDFLGYHSSVRMNSSMVRSQILENSFRINNAIKKFDTLLPIIPMI
SSLAKAKFCNVLGHPISKPVRADLSDSN
>G3MEH5
EDGINQVQSSVAEYPEAITYLLEQYDKYEAEQLRLSDIISGFIDPNETDDVAPTATHIGS
ELSEEDLADEDEDEDEDEDGDGDDSDDDGDGGPDPEVAREKFGELRAQYEVTRLSIQQNG
RAHEDTQNAIAQLADVFRQFRLMPKQFDRLVNNMREMMERVRVQERIIMKLCVEQAKMPK
KTFVAAFTNNECETAWFEYQKQAGKAWSPRLVEMDEDVLRAIGKLQQIEEET
>T1UMN5
MFPILSQFLNSGQQTIRAARYIGQGFMITLSHANRLPVTIQYPYEKLITSDRFRGRIHFE
FDKCIACEVCVRVCPIDLPVVDWKLEINIRKKRLLNYSIDFGICIFCGNCVEYCPTNCLS
MTEEYELSTYDRHELNYNQIALGRLPVSIIDDYTIRTISSNSPQIKNV
>S3GGS4
MKHVLSIQSHVVYGYAGNKSATFPMQLLGVDVWALNTVQFSNHTQYGKWTGMVIPKEQIG
EIVRGIDAIEALHLCDAIVSGYIGSAEQVEEIVNAVRFIKSKNPNALYLCDPVMGHPDKG
CIVAEGVKEGLINLAMAEADLITPNLVELRELSGLPVENFAQAQDAVRAILAKGPKKVLV
KHLSKVGKDSSQFEMLLATKDGMWHISRPLHQFRKEPVGVGDLTAGLFIANLLNGKSDIE
AFEHTANAVNDVMTVTQQKDNYELQIIAAREYIMQPSSQYKAVKIA
>I2I5C1
MARIIVVTSGKGGVGKTTSSAAIATGLAQKGKKTVVIDFDIGLRNLDLIMGCERRVVYDF
VNVIQGDATLNQALIKDKRTENLYILPASQTRDKDALTREGVAKVLDDLKAMDFEFIVCD
SPAGIETGALMALYFADEAIITTNPEVSSVRDSDRILGILASKSRRAENGEEPIKEHLLL
TRYNPGRVSRGDMLSMEDVLEILRIKLVGVIPEDQSVLRASNQGEPVILDINADAGKAYA
DTVERLLGEERPFRFIEEEKKGFLKRLFGG
>V1T7L7
KYYMDDITQENVMSFLTPVYLAGTLKGIVMVDVNQDNLKIFLYPGPSAGLALS

ADD COMMENT • link 10.4 years ago by Pierre Lindenbaum 164k

3

Entering edit mode

10.4 years ago

Elisabeth Gasteiger ★ 2.4k

On the UniProt web site, http://www.uniprot.org, you can add &random=yes to any query (see here).

The following query returns a random reviewed human entry:

http://www.uniprot.org/uniprot/?query=reviewed:yes+AND+organism:9606&random=yes

However, as far as using UniProtKB/Swiss-Prot to build negative data sets is concerned, please read this FAQ

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Elisabeth Gasteiger ★ 2.4k

0

Entering edit mode

10.4 years ago

Hugues ▴ 250

Here is some python code:

from Bio import ExPASy
from Bio import SwissProt
handle = ExPASy.get_sprot_raw('B5ZC00')
record = SwissProt.read(handle)
print record.sequence

To improve things further, you'll need to write a file with the ID of your proteins, and feed them (or some of them) to your query. Write the output in a FASTA file (with a header) instead of the prompt.

ADD COMMENT • link 10.4 years ago by Hugues ▴ 250

0

Entering edit mode

10.4 years ago

5heikki 11k

Generate two lists of random numbers with e.g. $RANDOM, multiply them, and treat the results as GI numbers and pull them from nr with blastdbcmd

ADD COMMENT • link 10.4 years ago by 5heikki 11k

0

Entering edit mode

Are all possible GI numbers attributed?

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Hugues ▴ 250

0

Entering edit mode

How to exclude membrane proteins? No, I think that a list is the way to go.

ADD REPLY • link 10.4 years ago by Hugues ▴ 250

0

Entering edit mode

It could be though that choosing randomly might work, and then filtering against membrane proteins. would be difficult though as would need a list of all known membrane proteins

ADD REPLY • link 10.4 years ago by arronslacey ▴ 320

Ram · Accepted Answer · 2014-07-10

5

Entering edit mode

10.4 years ago

Hugues ▴ 250

Here are all the ~20k reviewed proteins in homo sapiens in the UniProt database:

http://www.uniprot.org/uniprot/?query=%28taxonomy%3A9606%29+AND+reviewed%3Ayes

Play around, exclude some, include non-reviewed or change organism, then click download.

Choose which format suits you, i.e. FASTA.

Here is the first one (just for fun):

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS
WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY
LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY
YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD
AGEGEN

Then load some at random.

ADD COMMENT • link 10.4 years ago by Hugues ▴ 250

1

Entering edit mode

thanks Hugues - I think will end up combining your query with Pierre's implementation of using a mysql interface. Alex's suggestion about using "sample" is interesting too

ADD REPLY • link 10.4 years ago by arronslacey ▴ 320

1

Entering edit mode

I also like that I am able to add family NOT XXX.

This is really helpful when I want to pick out proteins that aren't in a certain family.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by arronslacey ▴ 320