[ensembl-dev] Building on Ensembl

Wed Mar 9 09:28:09 GMT 2011

Hi Reece

On Wed, Mar 9, 2011 at 4:28 AM, Reece Hart <reece at harts.net> wrote:
> Greetings Ensembl devs-
> Our small San Francisco-area startup is planning to use Ensembl as a basis
> for genome variation analysis tools. I'd appreciate some advice from the
> community about how to build on top of Ensembl core and variation in a way
> that enables future updates.

This sounds like a very interesting project.

> We anticipate extending Ensembl core and variation (at least) in two
> ways: adding content of the same types as those already in Ensembl, such as
> new variation_annotation or phenotype records, as well as adding new types
> of content (new tables), such as internal data for genotype-phenotype
> associations. That is, we will make both DML and DDL changes to an Ensembl
> release and we'd like to make it as easy as possible to transfer these
> changes to new releases.
> Two specific questions:
> 1) Do you have any advice about "layering" our content (DML changes) to
> facilitate Ensembl updates?
> I'm open to any idea here. We considered many approaches, but the two best
> are:
>
> Approach 1: For each Ensembl table, create a companion table with the
> same name in another schema. The companion table contains a foreign key to
> its Ensembl table and a hash of the Ensembl table's row in some canonical
> format. On upgrade, we compare keys that are unique to the Ensembl table,
> unique to the checksum table, or shared, in which case we compare hashes.

I'm a little unsure about this, some of our tables have a *lot* of
rows, and the overhead of computing and comparing hashes will be
non-trivial, plus storing the data in hashed form would, presumably,
make the database's intrinsic optimisations less useful, and maybe
even prevent indexes from working properly (or at all), which would
cause things to grind to a halt very quickly.

> Approach 2: Create a parallel schema with empty tables that will contain
> in-house data. Then, unify the Ensembl and in-house data in a third schema
> that contains UNION ALL views. For example, "CREATE VIEW merged.variation AS
> SELECT * FROM homo_sapiens_variation_61_37f.variation UNION ALL SELECT *
> FROM inhouse.variation".

This sounds better to me (although please note that we've not tried
either approach!) since it allows the database to still index etc.
Presumably you'll be using a similar "release" model to Ensembl, i.e.
your data gets built and then is effectively read-only until the next
release? If you're doing a lot of writes throughout the lifetime of
the release, views may not be so appropriate.

A third approach to consider is that the Registry can deal with
multiple databases, so a possibility may be to have (say) a human core
database containing the public data, and a second core database
containing just your internal data. We don't do this ourselves, but
it's an option to consider.

> 2) Which primary keys, if any, are stable across Ensembl releases?
> Our DDL changes will involve only new tables (not dropped tables). Those
> tables will contain foreign keys to primary keys within Ensembl. If Ensembl
> primary keys are stable across releases, then adding tables and transferring
> them to new releases is mostly straightforward. If the primary keys are not
> stable, I need to go back to the drawing board.

We make no guarantees whatsoever about any primary keys being stable
across releases. Any that are are coincidental. We also don't
encourage the use of intra-database primary/foreign key relationships.
The whole reason that we have things like gene stable IDs (and devote
a lot of effort to mapping the stable IDs across releases) are to
avoid exposing primary keys.

Glenn.