Reproducible research demands that we can rerun any analysis and get the same results. But how should we deal with this when analyses depend on databases that are frequently updated?
We are currently facing the problem that we have implemented a daily automatic BLAST db update, but we would like to be able to roll-back the database to any given date.
I can see three ways how this could be achieved:
1) In the best case this would be part of the blast command-line tool suite, for exapmle by specifying with a --date
flag up to which date the database should be queried. But to my knowledge, this functionality doesn't exist.
2) An easy solution would be to keep copies of the database. But this is obviously really resource intensive and doesn't seem a good solution, especially if frequent updates occur.
3) So the only reasonable solution would be to have some sort of versioning of the database. This would of course mean that if you want to rerun an analysis, you would have to copy the database first to a certain roll-back point and run the analysis from there.
Do you have experience with this or any suggestions of tools which could provide a sensible solution for versioning the BLAST database? Or am I missing something and this functionality already exists?
Any input or discussion would be appreciated.
makrez : This has come up in the past (NCBI was considering keeping some archival versions available last year, this is probably on back burner now because of SARS-CoV-2). What is the driving use case for this? GenBank is archival so you are more than likely to get a similar answer (plus new sequences that may have appeared).
Because of the size of NCBI databases (
nt
andnr
) this is at best impractical unless your institution has deep pockets to implement a local solution either via storage or at database level.Edit: As @leipzig points out this question may be purely about a custom internal blast database.
you're absolutely right, a bit of a tricky situation/issue indeed. But nonetheless happy to hear you care about reproducible science!
what DB are we talking here btw? something custom of something like
nr
ornt
or such ?when did the OP mention
nt
ornr
?I was writing the post with regards to the
nt
database.In that case you could consider implementing solution @Leipzig proposed below. Be aware that
nt
(77 GB) andnt
(73GB) compressed, are large files. They also can have multiple fasta identifiers pointing to identical sequence. Would be interesting to know if you can make the solution below work.Good point. Or even databases from NCBI for that matter. I was assuming that is what they want. Apologies for that.
OP did not (yet) indeed. I alluded to it, but no reply on it (yet)