[ensembl-dev] Corrupted predication matrix in local installation from Ensembl version 99 on

Fri Sep 25 17:21:06 BST 2020

Hello,
The mysql variation dump files have now been corrected for all species, across 
all the Ensembl divisions (i.e. ftp://ftp.ensembl.org/pub and 
ftp://ftp.ensemblgenomes.org/pub), for Ensembl releases 99, 100, 101 and Ensembl 
Genomes releases 46, 47, 48.

This fix corrects the compressed data in the dump files of four tables:
   * compressed_genotype_region
   * compressed_genotype_var
   * protein_function_predictions
   * protein_function_predictions_attrib
Dump files for other tables are unchanged.

The database prefixes for the affected species are listed below.

Regards,
James Allen
Ensembl Production team

Vertebrates
   * bos_taurus
   * canis_lupus_familiaris
   * capra_hircus
   * danio_rerio
   * equus_caballus
   * felis_catus
   * gallus_gallus
   * macaca_mulatta
   * mus_musculus
   * nomascus_leucogenys
   * ornithorhynchus_anatinus
   * ovis_aries
   * pan_troglodytes
   * pongo_abelii
   * rattus_norvegicus
   * saccharomyces_cerevisiae
   * sus_scrofa
   * taeniopygia_guttata

Fungi
   * fusarium_oxysporum
   * puccinia_graminis
   * puccinia_graminisug99
   * saccharomyces_cerevisiae
   * schizosaccharomyces_pombe
   * verticillium_dahliaejr2
   * zymoseptoria_tritici

Protists
   * phaeodactylum_tricornutum
   * phytophthora_infestans

Metazoa
   * aedes_aegypti_lvpagwg
   * anopheles_arabiensis
   * anopheles_culicifacies
   * anopheles_epiroticus
   * anopheles_farauti
   * anopheles_funestus
   * anopheles_gambiae
   * anopheles_melas
   * anopheles_merus
   * anopheles_minimus
   * anopheles_quadriannulatus
   * anopheles_sinensis
   * anopheles_stephensi_indian
   * anopheles_stephensi
   * biomphalaria_glabrata
   * culex_quinquefasciatus
   * ixodes_scapularis
   * lutzomyia_longipalpis
   * phlebotomus_papatasi

Plants
   * arabidopsis_thaliana
   * brachypodium_distachyon
   * helianthus_annuus
   * hordeum_vulgare
   * oryza_glaberrima
   * oryza_glumipatula
   * oryza_indica
   * oryza_sativa
   * solanum_lycopersicum
   * sorghum_bicolor
   * triticum_aestivum
   * triticum_turgidum
   * vitis_vinifera
   * zea_mays

On 21/09/2020 17:18, James Allen wrote:
> Hello,
> Thanks for raising this issue, and for the detailed diagnostics. There was an 
> error in the dumping process that created the protein_function_predictions table 
> (and, incidentally, the protein_function_predictions_attrib table), whereby '\r' 
> characters were erroneously removed.
> 
> We have recreated the dump files for the human database, for releases 99, 100, 
> and 101 (links below). Similar problems exist for some other species, which will 
> be fixed later this week (we'll confirm to this list when this is done, and to 
> which species).
> 
> Apologies for the inconvenience, hopefully this resolves your problems, but 
> please do get in touch if not.
> 
> Cheers,
> James
> Ensembl Production team
> 
> 
> ftp://ftp.ensembl.org/pub/release-99/mysql/homo_sapiens_variation_99_38/
> ftp://ftp.ensembl.org/pub/release-100/mysql/homo_sapiens_variation_100_38/
> ftp://ftp.ensembl.org/pub/release-101/mysql/homo_sapiens_variation_101_38/
> 
> 
> On 16/09/2020 16:20, Chris Weichenberger wrote:
>> Dear all,
>>
>> since Ensembl release 99 we are facing errors when using SIFT and PolyPhen
>> scores from the variation database on a local installation of the human
>> Ensembl database (GRCh38). Our program uses the Ensembl Perl API and upon
>> SIFT score calculations we end up with an error message as follows (example
>> is using Ensembl API version 100):
>>
>> -------------------- EXCEPTION --------------------
>> MSG: Failed to gunzip:
>> STACK Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::expand_matrix
>> /usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:654 
>>
>> STACK 
>> Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::prediction_from_matrix
>> /usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:699 
>>
>> STACK Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::get_prediction
>> /usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:349 
>>
>> STACK 
>> Bio::EnsEMBL::Variation::TranscriptVariationAllele::_protein_function_prediction
>> /usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm:1227 
>>
>> STACK Bio::EnsEMBL::Variation::TranscriptVariationAllele::sift_score
>> /usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm:1021 
>>
>> [removed: the rest of the stack originating from our software]
>>
>> When we are connecting to the public Ensembl database via the Perl API, the
>> program works as expected and computes the scores. Debugging this problem
>> revealed that the binary data for the prediction matrix stored in our
>> database is different to what is stored in the official Ensembl database.
>>
>> $ mysql -h ensembldb.ensembl.org -u anonymous
>>
>> mysql> use homo_sapiens_variation_100_38
>> mysql> SELECT  t.translation_md5, a.value, length(p.prediction_matrix)
>> FROM  (protein_function_predictions p, translation_md5 t, attrib a)
>> WHERE t.translation_md5 = '6cfea960ade30a16e3f138d55d1eaf03' AND
>>       a.value = 'sift'  AND
>>       p.translation_md5_id = t.translation_md5_id AND
>>       p.analysis_attrib_id = a.attrib_id;
>>
>> +----------------------------------+-------+-----------------------------+
>> | translation_md5                  | value | length(p.prediction_matrix) |
>> +----------------------------------+-------+-----------------------------+
>> | 6cfea960ade30a16e3f138d55d1eaf03 | sift  |                       16698 |
>> +----------------------------------+-------+-----------------------------+
>>
>> On our system however the size of the predication matrix is smaller:
>>
>> +----------------------------------+-------+-----------------------------+
>> | translation_md5                  | value | length(p.prediction_matrix) |
>> +----------------------------------+-------+-----------------------------+
>> | 6cfea960ade30a16e3f138d55d1eaf03 | sift  |                       16631 |
>> +----------------------------------+-------+-----------------------------+
>>
>> We have locally saved the blobs from both databases by patching the Perl
>> API. These are files that are zlib-compressed and start with the magic
>> number "VEP" after uncompressing them. It turns out that the original
>> Ensembl blob can be uncompressed with the zcat command, whereas there are
>> error messages when trying to zcat our blob from the local database (CRC
>> checksum error and file size error reported by zcat).
>>
>> We then did a binary diff of the hexdump of both compressed blobs stored as
>> files (xxd and vimdiff).  The *only* difference between our local blob and
>> the Ensembl blob is that ours is missing *all* 0x0d characters in the blob.
>> [see attached figure vimdiff-ing the hexdumps of the blobs as they are
>> stored in the database: left column, from public Ensembl database; right
>> column, from our local installation.] This also explains the smaller blob
>> size in our installation.
>>
>> The data were imported in the database exactly as described here:
>> https://www.ensembl.org/info/docs/webcode/mirror/install/ensembl-data.html
>> and the checksums were correctly verified with the "sum" utility.
>>
>> We discovered that up to release 98 queries to our internal installation
>> and to the public ensembldb.ensembl.org give identical results. Starting
>> with release 99, the results differ and we are getting the errors described
>> above.
>>
>> We are using MariaDB release 10.3.23 on Debian Bullseye (Testing), but we
>> also tried with a fresh MySQL installation from
>> https://dev.mysql.com/downloads/mysql/. We also tried using the "LOAD DATA
>> INFILE" instead of mysqlimport, always with the same results.
>>
>> Did anybody have similar experiences? Has the development team changed
>> anything in the format of the MySQL database dumps that affects blob
>> encoding? Might this be a pointer to a platform-specific issue, as only the
>> carriage return character '\r' (0x0d) is affected?
>>
>> Any help is highly appreciated. Thanks for sharing your thoughts on this.
>>
>> Chris and Daniele - EURAC research, Institute for Biomedicine
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: 
>> https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>> Ensembl Blog: http://www.ensembl.info/
>>