[ensembl-dev] Corrupted predication matrix in local installation from Ensembl version 99 on

Mon Sep 21 17:18:32 BST 2020

Hello,
Thanks for raising this issue, and for the detailed diagnostics. There was an 
error in the dumping process that created the protein_function_predictions table 
(and, incidentally, the protein_function_predictions_attrib table), whereby '\r' 
characters were erroneously removed.

We have recreated the dump files for the human database, for releases 99, 100, 
and 101 (links below). Similar problems exist for some other species, which will 
be fixed later this week (we'll confirm to this list when this is done, and to 
which species).

Apologies for the inconvenience, hopefully this resolves your problems, but 
please do get in touch if not.

Cheers,
James
Ensembl Production team

ftp://ftp.ensembl.org/pub/release-99/mysql/homo_sapiens_variation_99_38/
ftp://ftp.ensembl.org/pub/release-100/mysql/homo_sapiens_variation_100_38/
ftp://ftp.ensembl.org/pub/release-101/mysql/homo_sapiens_variation_101_38/

On 16/09/2020 16:20, Chris Weichenberger wrote:
> Dear all,
> 
> since Ensembl release 99 we are facing errors when using SIFT and PolyPhen
> scores from the variation database on a local installation of the human
> Ensembl database (GRCh38). Our program uses the Ensembl Perl API and upon
> SIFT score calculations we end up with an error message as follows (example
> is using Ensembl API version 100):
> 
> -------------------- EXCEPTION --------------------
> MSG: Failed to gunzip:
> STACK Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::expand_matrix
> /usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:654 
> 
> STACK 
> Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::prediction_from_matrix
> /usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:699 
> 
> STACK Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::get_prediction
> /usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:349 
> 
> STACK 
> Bio::EnsEMBL::Variation::TranscriptVariationAllele::_protein_function_prediction
> /usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm:1227 
> 
> STACK Bio::EnsEMBL::Variation::TranscriptVariationAllele::sift_score
> /usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm:1021 
> 
> [removed: the rest of the stack originating from our software]
> 
> When we are connecting to the public Ensembl database via the Perl API, the
> program works as expected and computes the scores. Debugging this problem
> revealed that the binary data for the prediction matrix stored in our
> database is different to what is stored in the official Ensembl database.
> 
> $ mysql -h ensembldb.ensembl.org -u anonymous
> 
> mysql> use homo_sapiens_variation_100_38
> mysql> SELECT  t.translation_md5, a.value, length(p.prediction_matrix)
> FROM  (protein_function_predictions p, translation_md5 t, attrib a)
> WHERE t.translation_md5 = '6cfea960ade30a16e3f138d55d1eaf03' AND
>       a.value = 'sift'  AND
>       p.translation_md5_id = t.translation_md5_id AND
>       p.analysis_attrib_id = a.attrib_id;
> 
> +----------------------------------+-------+-----------------------------+
> | translation_md5                  | value | length(p.prediction_matrix) |
> +----------------------------------+-------+-----------------------------+
> | 6cfea960ade30a16e3f138d55d1eaf03 | sift  |                       16698 |
> +----------------------------------+-------+-----------------------------+
> 
> On our system however the size of the predication matrix is smaller:
> 
> +----------------------------------+-------+-----------------------------+
> | translation_md5                  | value | length(p.prediction_matrix) |
> +----------------------------------+-------+-----------------------------+
> | 6cfea960ade30a16e3f138d55d1eaf03 | sift  |                       16631 |
> +----------------------------------+-------+-----------------------------+
> 
> We have locally saved the blobs from both databases by patching the Perl
> API. These are files that are zlib-compressed and start with the magic
> number "VEP" after uncompressing them. It turns out that the original
> Ensembl blob can be uncompressed with the zcat command, whereas there are
> error messages when trying to zcat our blob from the local database (CRC
> checksum error and file size error reported by zcat).
> 
> We then did a binary diff of the hexdump of both compressed blobs stored as
> files (xxd and vimdiff).  The *only* difference between our local blob and
> the Ensembl blob is that ours is missing *all* 0x0d characters in the blob.
> [see attached figure vimdiff-ing the hexdumps of the blobs as they are
> stored in the database: left column, from public Ensembl database; right
> column, from our local installation.] This also explains the smaller blob
> size in our installation.
> 
> The data were imported in the database exactly as described here:
> https://www.ensembl.org/info/docs/webcode/mirror/install/ensembl-data.html
> and the checksums were correctly verified with the "sum" utility.
> 
> We discovered that up to release 98 queries to our internal installation
> and to the public ensembldb.ensembl.org give identical results. Starting
> with release 99, the results differ and we are getting the errors described
> above.
> 
> We are using MariaDB release 10.3.23 on Debian Bullseye (Testing), but we
> also tried with a fresh MySQL installation from
> https://dev.mysql.com/downloads/mysql/. We also tried using the "LOAD DATA
> INFILE" instead of mysqlimport, always with the same results.
> 
> Did anybody have similar experiences? Has the development team changed
> anything in the format of the MySQL database dumps that affects blob
> encoding? Might this be a pointer to a platform-specific issue, as only the
> carriage return character '\r' (0x0d) is affected?
> 
> Any help is highly appreciated. Thanks for sharing your thoughts on this.
> 
> Chris and Daniele - EURAC research, Institute for Biomedicine
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/
>