This is an update to the answer here considering the architecture of the UniProt website. This is an architecture that makes sense for us but is not something I would recommend to anyone generally without really thinking hard about it.
UniProt website architecture in 2017
UniProt is a consortium database maintained by 3 partners SIB, EBI, PIR. EBI and PIR run the servers that run the website.
The UniProt data is mostly read only on the public website due to 4 weekly release cycle. Only job data like blast etc... is read-write.
All partners use Linux Cent-OS or RedHat 7 as their servers OS. EBI runs a proprietary load balancing solution while PIR use Apache 2+ with mod_proxy. Behind this font end we run a total of 10 tomcat servers. 8 at EBI (4 per datacentre, per EBI policy 4 are in hot standby mode and do not serve traffic on normal days) and 2 at PIR. On the latest security release of java. We use DNS round robbin for datacentre disappearance and load balancing behind it (you need to have both).
PIR uses physical machines with direct attached local harddisks while EBI uses virtual machines (VMware) with either Tintri attached disks for the data part or their Isilon system for temp files etc...
Search is provided by Lucene 6.6 and soon the 7 series. This enables a few features that stock lucene/solr do not provide such as the cross dataset queries.
Most data is no longer stored in BDB/je but in a custom datastore which is a classic key/offset map stored either using lucene or in memory depending on the data subset, The offset points into a large memory mapped file to a LZ4 or Zstd compresed binary representation of a data record. LZ4 for datasets like taxonomy, Zstd for UniParc, UniRef etc... Sequences are just data and not stored independently from the other data in our system. Key for this kind of architecture is to make sure we don't do more than 5 disk seeks per UniProtKB entry page (1 for the entry, up to 3 for similar proteins and 1 for taxonomy).
All state is injected from a xml file using the Spring dependency injection framework into different Struts actions. Which in practice is injection via webcontext. This is fine as our custom datastore is good for concurrent access as is our search engine lucene.
High availability is achieved by having individual data copies. Jobs such as blast data is shared on demand via http requests between mirrors. Which is ok as users rarely access data older than one hour.
Storage needs for release 2017_09 is almost 307GB for storing the data records and 438GB for lucene indexes (version 6.6).
I am still very happy with the architecture today. It has been with us with minimal issues for more than 10 years now and is still very performant. We are very happy with how Lucene kept up with the data explosion over time. Also we have a measured uptime of more than 99,9% which is not trivial with so many datacentres and users.
For the rest our Struts/JSP code could use an update as well and we have an active plan for that, but implementation is not quite decided yet.
The website as is only suitable for finding entries if you want to do analytical queries I strongly recommend using our sparql endpoint. This is a interface that exposes all the UniProt data via standard query language that is very suitable for deep queries over our datamodel.
Thank you so much! I really appreciate it. I have already spent couple of hours reading your answers. I work for a start-up and we do not have experience with data at this scale. I have been tasked to look into this big data management problem and maybe we will hire someone with more experience soon. I am thinking of using lucene for the search because I have some experience with it. But I am really stuck on the sequence data part. I have a couple of questions
1) I didn't quite understand: "Most data is no longer stored in BDB/je but in a custom datastore which is a classic key/offset map stored either using lucene or in memory depending on the data subset, The offset points into a large memory mapped file to a LZ4 or Zstd compresed binary representation of a data record."
Could you dumb this down for me? The only thing I am understanding is that your data record contains both the sequence and the metadata (e.g. name, length, species). The record is stored in a file and compressed in LZ4 format. What is stored in lucene? A key (e..g the file path) that quickly finds the LZ4 file? Sorry I am really lost ... with offset ... memory mapping ...etc. Could you explain or point me to some learning resources?
2) By SPARQL store do you mean something like this? https://db-engines.com/en/article/RDF+Stores
Really the only use cases I have are: 1) When a user selects one or few sequence identifiers (after a search using lucene) I need to quickly retrieve the sequences for further analysis (e.g. send it to our blast page in the front end) 2) User can select several thousands or tens of thousands of sequence identifiers and download the sequences.
Marklogic is out of question - I doubt we have the $$ to fund it. Will something like Jena scale? we will have hundreds of millions of sequences. Do I even need such datastores for my usecase?
Again, I really appreciate all your help
Don't follow the UniProt architecture it is too niche and not appropriate for you at this point in time. It's a formula 1 car and I am not sure that you are the stage where you need a moped or can still bike ;)
Jena TDB, Virtuoso will scale to a few 100 million of sequences given a reasonable hardware budget. At about a a trillion you will need to start and think hard about what you are doing. But if using SPARQL you will only need to fix your backend not rewrite your frontend when that day comes.
How long will you only search on sequence id's. When will your users need to do more accurate searches using the "meta"data e.g. where is the sequence from, which experiment, which domains, specific profiles.
What is fast enough, how many users. Given a blast will run for about 10seconds at that scale how fast should you pump data into that system at all? Are you cloud based or not?
Are you storing sequences or reads? Also maybe get a consulting contract. If you are going to generate millions of your own sequences then your lab costs will be significant and something like marklogic will fit in your budget (promethION early access is 135,000 in starting costs...). Especially if you are in a clinical space and will need to deal with legal requirements.
Yes we are still in the biking stage :) Thanks for your pointers.
Just to clarify - We are not working in a clinical setting. The software will not store reads. It is supposed to store protein and gene sequences along with metadata like species, lineage, annotations, length, etc.
Again, thank you ... I will look into Jena TDB and Viruoso ...