[ensembl-dev] Building on Ensembl
Reece Hart
reece at harts.net
Wed Mar 9 16:37:52 GMT 2011
Hi Glenn-
Thanks for your replies.
On Wed, Mar 9, 2011 at 1:28 AM, Glenn Proctor <glenn at ebi.ac.uk> wrote:
> I'm a little unsure about this, some of our tables have a *lot* of
> rows, and the overhead of computing and comparing hashes will be
> non-trivial, plus storing the data in hashed form would, presumably,
> make the database's intrinsic optimisations less useful, and maybe
> even prevent indexes from working properly (or at all), which would
> cause things to grind to a halt very quickly.
Hashes would be computed only twice: once just after installation, and once
just before migration. Furthermore, they'd be stored elsewhere. Think
tripwire (the old security tool) for Ensembl.
Let's use the variation table as an example. Approach 1 would create a new
schema and table, say hs6137fsha1.variation, to store <variation_id,sha1>
immediately after installation. Time would pass and we'd make changes. At
upgrade time, and only at upgrade time, we'd identify:
- new rows (keys in homo_sapiens_variation_61_37f.variation and not
in hs6137fsha1.variation)
- deleted rows (in hs6137fsha1.variation, not
in homo_sapiens_variation_61_37f.variation)
- changed rows (keys in both, sha1 differs)
The only computational burden would be after installation and during
migration.
The most common change is likely to be insert; in fact, I have no existing
use case for update or delete, but merely point out that this approach would
allow identification of such rows.
The context for all of this is that we will need to store novel variation
and associated data. The Ensembl structure should work well and allow us to
use existing tools. The only rub is how to transfer in-house data between
releases. Perhaps this context will trigger new ideas.
Thanks for your time.
-Reece
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20110309/f690848d/attachment.html>
More information about the Dev
mailing list