[ensembl-dev] Corrupted predication matrix in local installation from Ensembl version 99 on

Chris Weichenberger christian.weichenberger at eurac.edu
Wed Sep 16 16:20:55 BST 2020


Dear all,

since Ensembl release 99 we are facing errors when using SIFT and PolyPhen
scores from the variation database on a local installation of the human
Ensembl database (GRCh38). Our program uses the Ensembl Perl API and upon
SIFT score calculations we end up with an error message as follows (example
is using Ensembl API version 100):

-------------------- EXCEPTION --------------------
MSG: Failed to gunzip:
STACK Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::expand_matrix
/usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:654
STACK Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::prediction_from_matrix
/usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:699
STACK Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::get_prediction
/usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:349
STACK Bio::EnsEMBL::Variation::TranscriptVariationAllele::_protein_function_prediction
/usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm:1227
STACK Bio::EnsEMBL::Variation::TranscriptVariationAllele::sift_score
/usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm:1021
[removed: the rest of the stack originating from our software]

When we are connecting to the public Ensembl database via the Perl API, the
program works as expected and computes the scores. Debugging this problem
revealed that the binary data for the prediction matrix stored in our
database is different to what is stored in the official Ensembl database.

$ mysql -h ensembldb.ensembl.org -u anonymous

mysql> use homo_sapiens_variation_100_38
mysql> SELECT  t.translation_md5, a.value, length(p.prediction_matrix)
FROM  (protein_function_predictions p, translation_md5 t, attrib a)
WHERE t.translation_md5 = '6cfea960ade30a16e3f138d55d1eaf03' AND
      a.value = 'sift'  AND
      p.translation_md5_id = t.translation_md5_id AND
      p.analysis_attrib_id = a.attrib_id;

+----------------------------------+-------+-----------------------------+
| translation_md5                  | value | length(p.prediction_matrix) |
+----------------------------------+-------+-----------------------------+
| 6cfea960ade30a16e3f138d55d1eaf03 | sift  |                       16698 |
+----------------------------------+-------+-----------------------------+

On our system however the size of the predication matrix is smaller:

+----------------------------------+-------+-----------------------------+
| translation_md5                  | value | length(p.prediction_matrix) |
+----------------------------------+-------+-----------------------------+
| 6cfea960ade30a16e3f138d55d1eaf03 | sift  |                       16631 |
+----------------------------------+-------+-----------------------------+

We have locally saved the blobs from both databases by patching the Perl
API. These are files that are zlib-compressed and start with the magic
number "VEP" after uncompressing them. It turns out that the original
Ensembl blob can be uncompressed with the zcat command, whereas there are
error messages when trying to zcat our blob from the local database (CRC
checksum error and file size error reported by zcat).

We then did a binary diff of the hexdump of both compressed blobs stored as
files (xxd and vimdiff).  The *only* difference between our local blob and
the Ensembl blob is that ours is missing *all* 0x0d characters in the blob.
[see attached figure vimdiff-ing the hexdumps of the blobs as they are
stored in the database: left column, from public Ensembl database; right
column, from our local installation.] This also explains the smaller blob
size in our installation.

The data were imported in the database exactly as described here:
https://www.ensembl.org/info/docs/webcode/mirror/install/ensembl-data.html
and the checksums were correctly verified with the "sum" utility.

We discovered that up to release 98 queries to our internal installation
and to the public ensembldb.ensembl.org give identical results. Starting
with release 99, the results differ and we are getting the errors described
above.

We are using MariaDB release 10.3.23 on Debian Bullseye (Testing), but we
also tried with a fresh MySQL installation from
https://dev.mysql.com/downloads/mysql/. We also tried using the "LOAD DATA
INFILE" instead of mysqlimport, always with the same results.

Did anybody have similar experiences? Has the development team changed
anything in the format of the MySQL database dumps that affects blob
encoding? Might this be a pointer to a platform-specific issue, as only the
carriage return character '\r' (0x0d) is affected?

Any help is highly appreciated. Thanks for sharing your thoughts on this.

Chris and Daniele - EURAC research, Institute for Biomedicine
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 16698-vimdiff.png
Type: image/png
Size: 349108 bytes
Desc: not available
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20200916/7059fdc9/attachment.png>


More information about the Dev mailing list