Question

Space-Aware Storing Protein Sequence In Mysql

4

Entering edit mode

13.6 years ago

Leszek 4.2k

I need to store ~7 million of unique amino acid sequences in MySQL. Till now, I was storing sequence as TEXT type. Is there any other way of coding protein sequence in MySQL so it will take less space?

EDIT
Mentioned table of proteins is webserver backed. One of the functionalities of webserver is blast search, so all proteins are going to be compiled into blast db, anyway.
Do you think querying blastdb by fastacmd will be more reasonable than storing all data in MySQL? I haven't tried that so far but fastacmd is quite fast.
Webserver is running under Apache with jQuery and mod-python.

mysql protein amino-acids • 4.3k views

ADD COMMENT • link updated 13.6 years ago by Pierre Lindenbaum 166k • written 13.6 years ago by Leszek 4.2k

0

Entering edit mode

If all you want to do is to retrieve a sequence by name, fastacmd is fine. At the same time, I do not see a particular problem to store 7 million sequences in MySQL as long as you do not try to index the sequences.

ADD REPLY • link 13.6 years ago by lh3 33k

0

Entering edit mode

If all you want to do is to retrieve a sequence by name, fastacmd is fine. At the same time, I do not see a particular problem to store 7 million sequences in MySQL as long as you do not try to index the sequences. Nonetheless, this may not be the best strategy.

ADD REPLY • link 13.6 years ago by lh3 33k

score 4 · Answer 1 · 2011-10-04

4

Entering edit mode

13.6 years ago

brentp 24k

Don't do that. Have your table in MySQL be something like:

proteins
   - protein_name varchar
   - fpos         uint
   - length       uint

and then store your proteins in a text file. The fpos is the offset in that file of the start of the protein sequence and the length is (you guessed it) the length of the protein. From there, you can use any language to fseek to fpos and read out length characters when given a protein_name.

ADD COMMENT • link 13.6 years ago by brentp 24k

2

Entering edit mode

This is all IMHO... 1) relational db's are not meant to store long character strings. 2) in order to do anything with your data, you'll have to pull it out of the database anyway since all tools expect a flat-file format ... 3) [speaking of] all tools for working with sequences expect a flat-file format. (e.g. FASTA) 4) if you need to compress your data (as in this question), you can gzip it and it will still work with many tools.

ADD REPLY • link 13.6 years ago by brentp 24k

0

Entering edit mode

Out of curiosity: Can you please elaborate why this approach is better than storing the sequence directly in the db?

ADD REPLY • link 13.6 years ago by Chris ★ 1.6k

0

Entering edit mode

When you compress, you cannot achieve random access at the same time, at least not easily.

ADD REPLY • link 13.6 years ago by lh3 33k

0

Entering edit mode

@lh3 : tell that to the author of tabix ;)

ADD REPLY • link 13.6 years ago by brentp 24k

Ram · Answer 2 · 2011-10-04

mysql 5.5 contains some function to compress and uncompress the strings:

the fields will have to be stored in a BLOB.

The UCSC stores its sequence in a BLOB:

mysql> desc knownGenePep;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| name  | varchar(255) | NO   | PRI |         |       |
| seq   | longblob     | NO   |     |         |       |
+-------+--------------+------+-----+---------+-------+
2 rows in set (0.23 sec)

And, as said Brent, sometimes the tables only contain the path to the sequences:

mysql> select * from gbExtFile limit 2 \G
*************************** 1. row ***************************
  id: 1
path: /gbdb/genbank/./data/processed/genbank.182.0/full/mrna.fa
size: 3919595538
*************************** 2. row ***************************
  id: 2
path: /gbdb/genbank/./data/processed/genbank.182.0/daily.0216/mrna.fa
size: 791452
2 rows in set (0.22 sec)

And , as a personal choice, I would use a key/value engine (nosql) to store this kind of data (I mean, if the only goal of your database is storing those name/sequences).

Edit: see also SO: http://stackoverflow.com/search?q=database+storing+large+text

score 0 · Answer 3 · 2011-10-04

You could look at your 7 million protein sequences and search for repeats, then turn those repeats in a non-amino acid code (number or something else) and saving the key-code combination. Then when retrieving the sequence from the database you can use this key-code combination to convert back to the original protein sequence.

Example:

seq1: MVDPCAPLLQL
seq2: MVLLVCMVDPLAC
seq3: LLQLMVPDCLC

Repeat1: MVDP Code: 1
Repeat2: LLQL Code: 2
Repeat3: MV   Code: 3

Shortened seq1: 1CAP2
Shortened seq2: 3LLVC1LAC
Shortened seq3: 21CLC

Might take some time figuring out what the best way is to choose repeat length, if you want to preset it or do it dynamically, and what the best algorithm for it would be, but it can save a lot of space.

Cheers, Niek