[ensembl-dev] Corrupted predication matrix in local installation from Ensembl version 99 on

Tue Sep 29 13:12:52 BST 2020

Dear James,

thanks for taking care of this. I can confirm that with the updated tables
our local installation now reproduces the results that we receive when
using the public Ensembl server (release 100).

I am wondering if nobody else is using local installations anymore as we do
here? I suppose most of you out there are fine with running VEP locally
instead.

And sorry for the typo in the subject line, I suppose I have had the 'zcat'
command in mind when writing that mail - prediCATion matrix...

All the best,

Chris

On Fri, Sep 25, 2020 at 05:21:06PM +0100, James Allen wrote:
>Hello,
>The mysql variation dump files have now been corrected for all
>species, across all the Ensembl divisions (i.e.
>ftp://ftp.ensembl.org/pub and ftp://ftp.ensemblgenomes.org/pub), for
>Ensembl releases 99, 100, 101 and Ensembl Genomes releases 46, 47, 48.
>
>This fix corrects the compressed data in the dump files of four tables:
>  * compressed_genotype_region
>  * compressed_genotype_var
>  * protein_function_predictions
>  * protein_function_predictions_attrib
>Dump files for other tables are unchanged.
>
>The database prefixes for the affected species are listed below.
>
>Regards,
>James Allen
>Ensembl Production team
>
>
>Vertebrates
>  * bos_taurus
>  * canis_lupus_familiaris
>  * capra_hircus
>  * danio_rerio
>  * equus_caballus
>  * felis_catus
>  * gallus_gallus
>  * macaca_mulatta
>  * mus_musculus
>  * nomascus_leucogenys
>  * ornithorhynchus_anatinus
>  * ovis_aries
>  * pan_troglodytes
>  * pongo_abelii
>  * rattus_norvegicus
>  * saccharomyces_cerevisiae
>  * sus_scrofa
>  * taeniopygia_guttata
>
>Fungi
>  * fusarium_oxysporum
>  * puccinia_graminis
>  * puccinia_graminisug99
>  * saccharomyces_cerevisiae
>  * schizosaccharomyces_pombe
>  * verticillium_dahliaejr2
>  * zymoseptoria_tritici
>
>Protists
>  * phaeodactylum_tricornutum
>  * phytophthora_infestans
>
>Metazoa
>  * aedes_aegypti_lvpagwg
>  * anopheles_arabiensis
>  * anopheles_culicifacies
>  * anopheles_epiroticus
>  * anopheles_farauti
>  * anopheles_funestus
>  * anopheles_gambiae
>  * anopheles_melas
>  * anopheles_merus
>  * anopheles_minimus
>  * anopheles_quadriannulatus
>  * anopheles_sinensis
>  * anopheles_stephensi_indian
>  * anopheles_stephensi
>  * biomphalaria_glabrata
>  * culex_quinquefasciatus
>  * ixodes_scapularis
>  * lutzomyia_longipalpis
>  * phlebotomus_papatasi
>
>Plants
>  * arabidopsis_thaliana
>  * brachypodium_distachyon
>  * helianthus_annuus
>  * hordeum_vulgare
>  * oryza_glaberrima
>  * oryza_glumipatula
>  * oryza_indica
>  * oryza_sativa
>  * solanum_lycopersicum
>  * sorghum_bicolor
>  * triticum_aestivum
>  * triticum_turgidum
>  * vitis_vinifera
>  * zea_mays
>
>
>On 21/09/2020 17:18, James Allen wrote:
>>Hello,
>>Thanks for raising this issue, and for the detailed diagnostics.
>>There was an error in the dumping process that created the
>>protein_function_predictions table (and, incidentally, the
>>protein_function_predictions_attrib table), whereby '\r' characters
>>were erroneously removed.
>>
>>We have recreated the dump files for the human database, for
>>releases 99, 100, and 101 (links below). Similar problems exist for
>>some other species, which will be fixed later this week (we'll
>>confirm to this list when this is done, and to which species).
>>
>>Apologies for the inconvenience, hopefully this resolves your
>>problems, but please do get in touch if not.
>>
>>Cheers,
>>James
>>Ensembl Production team
>>
>>
>>ftp://ftp.ensembl.org/pub/release-99/mysql/homo_sapiens_variation_99_38/
>>ftp://ftp.ensembl.org/pub/release-100/mysql/homo_sapiens_variation_100_38/
>>ftp://ftp.ensembl.org/pub/release-101/mysql/homo_sapiens_variation_101_38/
>>
>>
>>On 16/09/2020 16:20, Chris Weichenberger wrote:
>>>Dear all,
>>>
>>>since Ensembl release 99 we are facing errors when using SIFT and PolyPhen
>>>scores from the variation database on a local installation of the human
>>>Ensembl database (GRCh38). Our program uses the Ensembl Perl API and upon
>>>SIFT score calculations we end up with an error message as follows (example
>>>is using Ensembl API version 100):
>>>
>>>-------------------- EXCEPTION --------------------
>>>MSG: Failed to gunzip:
>>>STACK Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::expand_matrix
>>>/usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:654
>>>
>>>STACK Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::prediction_from_matrix
>>>/usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:699
>>>
>>>STACK Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::get_prediction
>>>/usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:349
>>>
>>>STACK Bio::EnsEMBL::Variation::TranscriptVariationAllele::_protein_function_prediction
>>>/usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm:1227
>>>
>>>STACK Bio::EnsEMBL::Variation::TranscriptVariationAllele::sift_score
>>>/usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm:1021
>>>
>>>[removed: the rest of the stack originating from our software]
>>>
>>>When we are connecting to the public Ensembl database via the Perl API, the
>>>program works as expected and computes the scores. Debugging this problem
>>>revealed that the binary data for the prediction matrix stored in our
>>>database is different to what is stored in the official Ensembl database.
>>>
>>>$ mysql -h ensembldb.ensembl.org -u anonymous
>>>
>>>mysql> use homo_sapiens_variation_100_38
>>>mysql> SELECT  t.translation_md5, a.value, length(p.prediction_matrix)
>>>FROM  (protein_function_predictions p, translation_md5 t, attrib a)
>>>WHERE t.translation_md5 = '6cfea960ade30a16e3f138d55d1eaf03' AND
>>>      a.value = 'sift'  AND
>>>      p.translation_md5_id = t.translation_md5_id AND
>>>      p.analysis_attrib_id = a.attrib_id;
>>>
>>>+----------------------------------+-------+-----------------------------+
>>>| translation_md5                  | value | length(p.prediction_matrix) |
>>>+----------------------------------+-------+-----------------------------+
>>>| 6cfea960ade30a16e3f138d55d1eaf03 | sift  |                       16698 |
>>>+----------------------------------+-------+-----------------------------+
>>>
>>>On our system however the size of the predication matrix is smaller:
>>>
>>>+----------------------------------+-------+-----------------------------+
>>>| translation_md5                  | value | length(p.prediction_matrix) |
>>>+----------------------------------+-------+-----------------------------+
>>>| 6cfea960ade30a16e3f138d55d1eaf03 | sift  |                       16631 |
>>>+----------------------------------+-------+-----------------------------+
>>>
>>>We have locally saved the blobs from both databases by patching the Perl
>>>API. These are files that are zlib-compressed and start with the magic
>>>number "VEP" after uncompressing them. It turns out that the original
>>>Ensembl blob can be uncompressed with the zcat command, whereas there are
>>>error messages when trying to zcat our blob from the local database (CRC
>>>checksum error and file size error reported by zcat).
>>>
>>>We then did a binary diff of the hexdump of both compressed blobs stored as
>>>files (xxd and vimdiff).  The *only* difference between our local blob and
>>>the Ensembl blob is that ours is missing *all* 0x0d characters in the blob.
>>>[see attached figure vimdiff-ing the hexdumps of the blobs as they are
>>>stored in the database: left column, from public Ensembl database; right
>>>column, from our local installation.] This also explains the smaller blob
>>>size in our installation.
>>>
>>>The data were imported in the database exactly as described here:
>>>https://www.ensembl.org/info/docs/webcode/mirror/install/ensembl-data.html
>>>and the checksums were correctly verified with the "sum" utility.
>>>
>>>We discovered that up to release 98 queries to our internal installation
>>>and to the public ensembldb.ensembl.org give identical results. Starting
>>>with release 99, the results differ and we are getting the errors described
>>>above.
>>>
>>>We are using MariaDB release 10.3.23 on Debian Bullseye (Testing), but we
>>>also tried with a fresh MySQL installation from
>>>https://dev.mysql.com/downloads/mysql/. We also tried using the "LOAD DATA
>>>INFILE" instead of mysqlimport, always with the same results.
>>>
>>>Did anybody have similar experiences? Has the development team changed
>>>anything in the format of the MySQL database dumps that affects blob
>>>encoding? Might this be a pointer to a platform-specific issue, as only the
>>>carriage return character '\r' (0x0d) is affected?
>>>
>>>Any help is highly appreciated. Thanks for sharing your thoughts on this.
>>>
>>>Chris and Daniele - EURAC research, Institute for Biomedicine
>>>
>>>_______________________________________________
>>>Dev mailing list    Dev at ensembl.org
>>>Posting guidelines and subscribe/unsubscribe info:
>>>https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>>>Ensembl Blog: http://www.ensembl.info/
>>>
>
>_______________________________________________
>Dev mailing list    Dev at ensembl.org
>Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>Ensembl Blog: http://www.ensembl.info/