[ensembl-dev] cf. maatkit for table checksums, Re: Building on Ensembl

Matthew Astley mca at sanger.ac.uk
Mon Mar 14 09:49:24 GMT 2011


On Fri, Mar 11, 2011 at 08:58:06AM -0800, Reece Hart wrote:

> [...] The proposal is to hash rows in release 61, let
> insert/update/delete changes happen during the course of our use,
> then, when 62 is released, repeat the hash process *in the same
> database*. The point is to identify changes that were made to the
> database instance since installation. [...]

For comparison, are you aware of maatkit?

  http://code.google.com/p/maatkit/
  "Maatkit is a set of high-quality command-line tools for open-source
  databases, such as MySQL and PostgreSQL.  [...] The toolkit includes
  functionality such as table checksums, [...]"

  http://packages.debian.org/maatkit
  (this is where I get it)

mk-table-checksum will generate (in various ways) checksums for entire
tables, with an option to do this with "--chunksize" of approximately
N rows.

You need per-row checksums?  I think mk-table-checksum is not designed
to do this.


> mysql> create table reece.variation_sha1 as select
> variation_id,sha1(concat(coalesce(source_id,'NULL'),coalesce(name,'NULL'),coalesce(validation_status,'NULL'),coalesce(ancestral_allele,'NULL'),coalesce(flipped,'NULL'),coalesce(class_so_id,'NULL')))
> as sha1 from variation ;
> Query OK, 30443264 rows affected (2 min 9.36 sec)
> Records: 30443264  Duplicates: 0  Warnings: 0
> 
> So, that's ~2 minutes to checksum 30M rows. That completely allays
> my concern about timing.

It will depend on your hardware, but if you can avoid seeking it's
likely to be CPU bound (on one processor).

Whether you can parallelise it without slowing the I/O throughput, or
the server is smart enough to hand out independent rows to other CPUs,
is the fun part.  8-]

If you don't need the strength of sha1 there are other options
  http://www.maatkit.org/doc/mk-table-checksum.html#how_fast_is_it

and later on that manpage,

| [...]  For speed, all checksums are done in parallel (as much as
| possible) and may complete out of the order in which they were
| started.


HTH,
-- 
Matthew




More information about the Dev mailing list