Space-Aware Storing Protein Sequence In Mysql
3
4
Entering edit mode
13.2 years ago
Leszek 4.2k

I need to store ~7 million of unique amino acid sequences in MySQL. Till now, I was storing sequence as TEXT type. Is there any other way of coding protein sequence in MySQL so it will take less space?

EDIT
Mentioned table of proteins is webserver backed. One of the functionalities of webserver is blast search, so all proteins are going to be compiled into blast db, anyway.
Do you think querying blastdb by fastacmd will be more reasonable than storing all data in MySQL? I haven't tried that so far but fastacmd is quite fast.
Webserver is running under Apache with jQuery and mod-python.

mysql protein amino-acids • 4.0k views
ADD COMMENT
0
Entering edit mode

If all you want to do is to retrieve a sequence by name, fastacmd is fine. At the same time, I do not see a particular problem to store 7 million sequences in MySQL as long as you do not try to index the sequences.

ADD REPLY
0
Entering edit mode

If all you want to do is to retrieve a sequence by name, fastacmd is fine. At the same time, I do not see a particular problem to store 7 million sequences in MySQL as long as you do not try to index the sequences. Nonetheless, this may not be the best strategy.

ADD REPLY
4
Entering edit mode
13.2 years ago
brentp 24k

Don't do that. Have your table in MySQL be something like:

proteins
   - protein_name varchar
   - fpos         uint
   - length       uint

and then store your proteins in a text file. The fpos is the offset in that file of the start of the protein sequence and the length is (you guessed it) the length of the protein. From there, you can use any language to fseek to fpos and read out length characters when given a protein_name.

ADD COMMENT
2
Entering edit mode

This is all IMHO... 1) relational db's are not meant to store long character strings. 2) in order to do anything with your data, you'll have to pull it out of the database anyway since all tools expect a flat-file format ... 3) [speaking of] all tools for working with sequences expect a flat-file format. (e.g. FASTA) 4) if you need to compress your data (as in this question), you can gzip it and it will still work with many tools.

ADD REPLY
0
Entering edit mode

Out of curiosity: Can you please elaborate why this approach is better than storing the sequence directly in the db?

ADD REPLY
0
Entering edit mode

When you compress, you cannot achieve random access at the same time, at least not easily.

ADD REPLY
0
Entering edit mode

@lh3 : tell that to the author of tabix ;)

ADD REPLY
3
Entering edit mode
13.2 years ago

mysql 5.5 contains some function to compress and uncompress the strings:

the fields will have to be stored in a BLOB.

The UCSC stores its sequence in a BLOB:

mysql> desc knownGenePep;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| name  | varchar(255) | NO   | PRI |         |       |
| seq   | longblob     | NO   |     |         |       |
+-------+--------------+------+-----+---------+-------+
2 rows in set (0.23 sec)

And, as said Brent, sometimes the tables only contain the path to the sequences:

mysql> select * from gbExtFile limit 2 \G
*************************** 1. row ***************************
  id: 1
path: /gbdb/genbank/./data/processed/genbank.182.0/full/mrna.fa
size: 3919595538
*************************** 2. row ***************************
  id: 2
path: /gbdb/genbank/./data/processed/genbank.182.0/daily.0216/mrna.fa
size: 791452
2 rows in set (0.22 sec)

And , as a personal choice, I would use a key/value engine (nosql) to store this kind of data (I mean, if the only goal of your database is storing those name/sequences).

Edit: see also SO: http://stackoverflow.com/search?q=database+storing+large+text

ADD COMMENT
0
Entering edit mode
13.2 years ago
Niek De Klein ★ 2.6k

You could look at your 7 million protein sequences and search for repeats, then turn those repeats in a non-amino acid code (number or something else) and saving the key-code combination. Then when retrieving the sequence from the database you can use this key-code combination to convert back to the original protein sequence.

Example:

seq1: MVDPCAPLLQL
seq2: MVLLVCMVDPLAC
seq3: LLQLMVPDCLC

Repeat1: MVDP Code: 1
Repeat2: LLQL Code: 2
Repeat3: MV   Code: 3

Shortened seq1: 1CAP2
Shortened seq2: 3LLVC1LAC
Shortened seq3: 21CLC

Might take some time figuring out what the best way is to choose repeat length, if you want to preset it or do it dynamically, and what the best algorithm for it would be, but it can save a lot of space.

Cheers, Niek

ADD COMMENT

Login before adding your answer.

Traffic: 1265 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6