[ensembl-dev] Corrupted predication matrix in local installation from Ensembl version 99 on
Chris Weichenberger
christian.weichenberger at eurac.edu
Tue Sep 29 13:12:52 BST 2020
Dear James,
thanks for taking care of this. I can confirm that with the updated tables
our local installation now reproduces the results that we receive when
using the public Ensembl server (release 100).
I am wondering if nobody else is using local installations anymore as we do
here? I suppose most of you out there are fine with running VEP locally
instead.
And sorry for the typo in the subject line, I suppose I have had the 'zcat'
command in mind when writing that mail - prediCATion matrix...
All the best,
Chris
On Fri, Sep 25, 2020 at 05:21:06PM +0100, James Allen wrote:
>Hello,
>The mysql variation dump files have now been corrected for all
>species, across all the Ensembl divisions (i.e.
>ftp://ftp.ensembl.org/pub and ftp://ftp.ensemblgenomes.org/pub), for
>Ensembl releases 99, 100, 101 and Ensembl Genomes releases 46, 47, 48.
>
>This fix corrects the compressed data in the dump files of four tables:
> * compressed_genotype_region
> * compressed_genotype_var
> * protein_function_predictions
> * protein_function_predictions_attrib
>Dump files for other tables are unchanged.
>
>The database prefixes for the affected species are listed below.
>
>Regards,
>James Allen
>Ensembl Production team
>
>
>Vertebrates
> * bos_taurus
> * canis_lupus_familiaris
> * capra_hircus
> * danio_rerio
> * equus_caballus
> * felis_catus
> * gallus_gallus
> * macaca_mulatta
> * mus_musculus
> * nomascus_leucogenys
> * ornithorhynchus_anatinus
> * ovis_aries
> * pan_troglodytes
> * pongo_abelii
> * rattus_norvegicus
> * saccharomyces_cerevisiae
> * sus_scrofa
> * taeniopygia_guttata
>
>Fungi
> * fusarium_oxysporum
> * puccinia_graminis
> * puccinia_graminisug99
> * saccharomyces_cerevisiae
> * schizosaccharomyces_pombe
> * verticillium_dahliaejr2
> * zymoseptoria_tritici
>
>Protists
> * phaeodactylum_tricornutum
> * phytophthora_infestans
>
>Metazoa
> * aedes_aegypti_lvpagwg
> * anopheles_arabiensis
> * anopheles_culicifacies
> * anopheles_epiroticus
> * anopheles_farauti
> * anopheles_funestus
> * anopheles_gambiae
> * anopheles_melas
> * anopheles_merus
> * anopheles_minimus
> * anopheles_quadriannulatus
> * anopheles_sinensis
> * anopheles_stephensi_indian
> * anopheles_stephensi
> * biomphalaria_glabrata
> * culex_quinquefasciatus
> * ixodes_scapularis
> * lutzomyia_longipalpis
> * phlebotomus_papatasi
>
>Plants
> * arabidopsis_thaliana
> * brachypodium_distachyon
> * helianthus_annuus
> * hordeum_vulgare
> * oryza_glaberrima
> * oryza_glumipatula
> * oryza_indica
> * oryza_sativa
> * solanum_lycopersicum
> * sorghum_bicolor
> * triticum_aestivum
> * triticum_turgidum
> * vitis_vinifera
> * zea_mays
>
>
>On 21/09/2020 17:18, James Allen wrote:
>>Hello,
>>Thanks for raising this issue, and for the detailed diagnostics.
>>There was an error in the dumping process that created the
>>protein_function_predictions table (and, incidentally, the
>>protein_function_predictions_attrib table), whereby '\r' characters
>>were erroneously removed.
>>
>>We have recreated the dump files for the human database, for
>>releases 99, 100, and 101 (links below). Similar problems exist for
>>some other species, which will be fixed later this week (we'll
>>confirm to this list when this is done, and to which species).
>>
>>Apologies for the inconvenience, hopefully this resolves your
>>problems, but please do get in touch if not.
>>
>>Cheers,
>>James
>>Ensembl Production team
>>
>>
>>ftp://ftp.ensembl.org/pub/release-99/mysql/homo_sapiens_variation_99_38/
>>ftp://ftp.ensembl.org/pub/release-100/mysql/homo_sapiens_variation_100_38/
>>ftp://ftp.ensembl.org/pub/release-101/mysql/homo_sapiens_variation_101_38/
>>
>>
>>On 16/09/2020 16:20, Chris Weichenberger wrote:
>>>Dear all,
>>>
>>>since Ensembl release 99 we are facing errors when using SIFT and PolyPhen
>>>scores from the variation database on a local installation of the human
>>>Ensembl database (GRCh38). Our program uses the Ensembl Perl API and upon
>>>SIFT score calculations we end up with an error message as follows (example
>>>is using Ensembl API version 100):
>>>
>>>-------------------- EXCEPTION --------------------
>>>MSG: Failed to gunzip:
>>>STACK Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::expand_matrix
>>>/usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:654
>>>
>>>STACK Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::prediction_from_matrix
>>>/usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:699
>>>
>>>STACK Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::get_prediction
>>>/usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:349
>>>
>>>STACK Bio::EnsEMBL::Variation::TranscriptVariationAllele::_protein_function_prediction
>>>/usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm:1227
>>>
>>>STACK Bio::EnsEMBL::Variation::TranscriptVariationAllele::sift_score
>>>/usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm:1021
>>>
>>>[removed: the rest of the stack originating from our software]
>>>
>>>When we are connecting to the public Ensembl database via the Perl API, the
>>>program works as expected and computes the scores. Debugging this problem
>>>revealed that the binary data for the prediction matrix stored in our
>>>database is different to what is stored in the official Ensembl database.
>>>
>>>$ mysql -h ensembldb.ensembl.org -u anonymous
>>>
>>>mysql> use homo_sapiens_variation_100_38
>>>mysql> SELECT t.translation_md5, a.value, length(p.prediction_matrix)
>>>FROM (protein_function_predictions p, translation_md5 t, attrib a)
>>>WHERE t.translation_md5 = '6cfea960ade30a16e3f138d55d1eaf03' AND
>>> a.value = 'sift' AND
>>> p.translation_md5_id = t.translation_md5_id AND
>>> p.analysis_attrib_id = a.attrib_id;
>>>
>>>+----------------------------------+-------+-----------------------------+
>>>| translation_md5 | value | length(p.prediction_matrix) |
>>>+----------------------------------+-------+-----------------------------+
>>>| 6cfea960ade30a16e3f138d55d1eaf03 | sift | 16698 |
>>>+----------------------------------+-------+-----------------------------+
>>>
>>>On our system however the size of the predication matrix is smaller:
>>>
>>>+----------------------------------+-------+-----------------------------+
>>>| translation_md5 | value | length(p.prediction_matrix) |
>>>+----------------------------------+-------+-----------------------------+
>>>| 6cfea960ade30a16e3f138d55d1eaf03 | sift | 16631 |
>>>+----------------------------------+-------+-----------------------------+
>>>
>>>We have locally saved the blobs from both databases by patching the Perl
>>>API. These are files that are zlib-compressed and start with the magic
>>>number "VEP" after uncompressing them. It turns out that the original
>>>Ensembl blob can be uncompressed with the zcat command, whereas there are
>>>error messages when trying to zcat our blob from the local database (CRC
>>>checksum error and file size error reported by zcat).
>>>
>>>We then did a binary diff of the hexdump of both compressed blobs stored as
>>>files (xxd and vimdiff). The *only* difference between our local blob and
>>>the Ensembl blob is that ours is missing *all* 0x0d characters in the blob.
>>>[see attached figure vimdiff-ing the hexdumps of the blobs as they are
>>>stored in the database: left column, from public Ensembl database; right
>>>column, from our local installation.] This also explains the smaller blob
>>>size in our installation.
>>>
>>>The data were imported in the database exactly as described here:
>>>https://www.ensembl.org/info/docs/webcode/mirror/install/ensembl-data.html
>>>and the checksums were correctly verified with the "sum" utility.
>>>
>>>We discovered that up to release 98 queries to our internal installation
>>>and to the public ensembldb.ensembl.org give identical results. Starting
>>>with release 99, the results differ and we are getting the errors described
>>>above.
>>>
>>>We are using MariaDB release 10.3.23 on Debian Bullseye (Testing), but we
>>>also tried with a fresh MySQL installation from
>>>https://dev.mysql.com/downloads/mysql/. We also tried using the "LOAD DATA
>>>INFILE" instead of mysqlimport, always with the same results.
>>>
>>>Did anybody have similar experiences? Has the development team changed
>>>anything in the format of the MySQL database dumps that affects blob
>>>encoding? Might this be a pointer to a platform-specific issue, as only the
>>>carriage return character '\r' (0x0d) is affected?
>>>
>>>Any help is highly appreciated. Thanks for sharing your thoughts on this.
>>>
>>>Chris and Daniele - EURAC research, Institute for Biomedicine
>>>
>>>_______________________________________________
>>>Dev mailing list Dev at ensembl.org
>>>Posting guidelines and subscribe/unsubscribe info:
>>>https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>>>Ensembl Blog: http://www.ensembl.info/
>>>
>
>_______________________________________________
>Dev mailing list Dev at ensembl.org
>Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>Ensembl Blog: http://www.ensembl.info/
More information about the Dev
mailing list