[ensembl-dev] cf. maatkit for table checksums, Re: Building on Ensembl
Matthew Astley
mca at sanger.ac.uk
Mon Mar 14 09:49:24 GMT 2011
On Fri, Mar 11, 2011 at 08:58:06AM -0800, Reece Hart wrote:
> [...] The proposal is to hash rows in release 61, let
> insert/update/delete changes happen during the course of our use,
> then, when 62 is released, repeat the hash process *in the same
> database*. The point is to identify changes that were made to the
> database instance since installation. [...]
For comparison, are you aware of maatkit?
http://code.google.com/p/maatkit/
"Maatkit is a set of high-quality command-line tools for open-source
databases, such as MySQL and PostgreSQL. [...] The toolkit includes
functionality such as table checksums, [...]"
http://packages.debian.org/maatkit
(this is where I get it)
mk-table-checksum will generate (in various ways) checksums for entire
tables, with an option to do this with "--chunksize" of approximately
N rows.
You need per-row checksums? I think mk-table-checksum is not designed
to do this.
> mysql> create table reece.variation_sha1 as select
> variation_id,sha1(concat(coalesce(source_id,'NULL'),coalesce(name,'NULL'),coalesce(validation_status,'NULL'),coalesce(ancestral_allele,'NULL'),coalesce(flipped,'NULL'),coalesce(class_so_id,'NULL')))
> as sha1 from variation ;
> Query OK, 30443264 rows affected (2 min 9.36 sec)
> Records: 30443264 Duplicates: 0 Warnings: 0
>
> So, that's ~2 minutes to checksum 30M rows. That completely allays
> my concern about timing.
It will depend on your hardware, but if you can avoid seeking it's
likely to be CPU bound (on one processor).
Whether you can parallelise it without slowing the I/O throughput, or
the server is smart enough to hand out independent rows to other CPUs,
is the fun part. 8-]
If you don't need the strength of sha1 there are other options
http://www.maatkit.org/doc/mk-table-checksum.html#how_fast_is_it
and later on that manpage,
| [...] For speed, all checksums are done in parallel (as much as
| possible) and may complete out of the order in which they were
| started.
HTH,
--
Matthew
More information about the Dev
mailing list