[ensembl-dev] Building on Ensembl

Wed Mar 9 17:06:08 GMT 2011

Hi Reece,

Just to jump in on this from a variation team perspective:

On 9 March 2011 16:37, Reece Hart <reece at harts.net> wrote:
> Hi Glenn-
> Thanks for your replies.
> On Wed, Mar 9, 2011 at 1:28 AM, Glenn Proctor <glenn at ebi.ac.uk> wrote:
>>
>> I'm a little unsure about this, some of our tables have a *lot* of
>> rows, and the overhead of computing and comparing hashes will be
>> non-trivial, plus storing the data in hashed form would, presumably,
>> make the database's intrinsic optimisations less useful, and maybe
>> even prevent indexes from working properly (or at all), which would
>> cause things to grind to a halt very quickly.
>
> Hashes would be computed only twice: once just after installation, and once
> just before migration. Furthermore, they'd be stored elsewhere. Think
> tripwire (the old security tool) for Ensembl.
> Let's use the variation table as an example. Approach 1 would create a new
> schema and table, say hs6137fsha1.variation, to store <variation_id,sha1>
> immediately after installation. Time would pass and we'd make changes. At
> upgrade time, and only at upgrade time, we'd identify:
>
> new rows (keys in homo_sapiens_variation_61_37f.variation and not
> in hs6137fsha1.variation)
> deleted rows (in hs6137fsha1.variation, not
> in homo_sapiens_variation_61_37f.variation)
> changed rows (keys in both, sha1 differs)

In many cases this will be true; however, when we build a new Ensembl
Variation database from a new dbSNP release, there will be basically
no rows in common between the new and previous DBs. This is because
the internal keys used will differ as the database is built from
scratch, not by building on top of the previous release.

In these instances I would imagine your approach would become
extremely inefficient.

For releases when we do not have a fresh build of dbSNP, then perhaps
your approach has some merit as we will generally just be adding data
on top of the previous DB.

Have you factored in the size of our tables? In human, for example,
our variation database is 76gb in total, with many tables containing
hundreds of millions of rows.

Anyway, sounds like an interesting project, and good luck!

Cheers

Will McLaren
Ensembl Variation

>
> The only computational burden would be after installation and during
> migration.
> The most common change is likely to be insert; in fact, I have no existing
> use case for update or delete, but merely point out that this approach would
> allow identification of such rows.
> The context for all of this is that we will need to store novel variation
> and associated data. The Ensembl structure should work well and allow us to
> use existing tools. The only rub is how to transfer in-house data between
> releases. Perhaps this  context will trigger new ideas.
> Thanks for your time.
> -Reece
>
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev
>
>