[ensembl-dev] Building on Ensembl

Fri Mar 11 16:58:06 GMT 2011

On Wed, Mar 9, 2011 at 9:06 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:

> In many cases this will be true; however, when we build a new Ensembl
> Variation database from a new dbSNP release, there will be basically
> no rows in common between the new and previous DBs.
>

I wasn't sufficiently clear. The proposal is to hash rows in release 61, let
insert/update/delete changes happen during the course of our use, then, when
62 is released, repeat the hash process *in the same database*. The point is
to identify changes that were made to the database instance since
installation. Propagating those changes would depend on how much the schema
changed, but would generally be done semi-automatically.

> Have you factored in the size of our tables? In human, for example,
> our variation database is 76gb in total, with many tables containing
> hundreds of millions of rows.
>

I just ran a test that was even faster than I expected. Here's a sample:

mysql> create table reece.variation_sha1 as select
variation_id,sha1(concat(coalesce(source_id,'NULL'),coalesce(name,'NULL'),coalesce(validation_status,'NULL'),coalesce(ancestral_allele,'NULL'),coalesce(flipped,'NULL'),coalesce(class_so_id,'NULL')))
as sha1 from variation ;
Query OK, 30443264 rows affected (2 min 9.36 sec)
Records: 30443264  Duplicates: 0  Warnings: 0

So, that's ~2 minutes to checksum 30M rows. That completely allays my
concern about timing. If you remain concerned, I'd love to know what I'm not
seeing. (BTW, this is on an m1.large instance with an EBS mysql data
directory.)

Thanks,
Reece
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20110311/f450bb54/attachment.html>