[ensembl-dev] Corrupted predication matrix in local installation from Ensembl version 99 on
James Allen
jallen at ebi.ac.uk
Fri Sep 25 17:21:06 BST 2020
Hello,
The mysql variation dump files have now been corrected for all species, across
all the Ensembl divisions (i.e. ftp://ftp.ensembl.org/pub and
ftp://ftp.ensemblgenomes.org/pub), for Ensembl releases 99, 100, 101 and Ensembl
Genomes releases 46, 47, 48.
This fix corrects the compressed data in the dump files of four tables:
* compressed_genotype_region
* compressed_genotype_var
* protein_function_predictions
* protein_function_predictions_attrib
Dump files for other tables are unchanged.
The database prefixes for the affected species are listed below.
Regards,
James Allen
Ensembl Production team
Vertebrates
* bos_taurus
* canis_lupus_familiaris
* capra_hircus
* danio_rerio
* equus_caballus
* felis_catus
* gallus_gallus
* macaca_mulatta
* mus_musculus
* nomascus_leucogenys
* ornithorhynchus_anatinus
* ovis_aries
* pan_troglodytes
* pongo_abelii
* rattus_norvegicus
* saccharomyces_cerevisiae
* sus_scrofa
* taeniopygia_guttata
Fungi
* fusarium_oxysporum
* puccinia_graminis
* puccinia_graminisug99
* saccharomyces_cerevisiae
* schizosaccharomyces_pombe
* verticillium_dahliaejr2
* zymoseptoria_tritici
Protists
* phaeodactylum_tricornutum
* phytophthora_infestans
Metazoa
* aedes_aegypti_lvpagwg
* anopheles_arabiensis
* anopheles_culicifacies
* anopheles_epiroticus
* anopheles_farauti
* anopheles_funestus
* anopheles_gambiae
* anopheles_melas
* anopheles_merus
* anopheles_minimus
* anopheles_quadriannulatus
* anopheles_sinensis
* anopheles_stephensi_indian
* anopheles_stephensi
* biomphalaria_glabrata
* culex_quinquefasciatus
* ixodes_scapularis
* lutzomyia_longipalpis
* phlebotomus_papatasi
Plants
* arabidopsis_thaliana
* brachypodium_distachyon
* helianthus_annuus
* hordeum_vulgare
* oryza_glaberrima
* oryza_glumipatula
* oryza_indica
* oryza_sativa
* solanum_lycopersicum
* sorghum_bicolor
* triticum_aestivum
* triticum_turgidum
* vitis_vinifera
* zea_mays
On 21/09/2020 17:18, James Allen wrote:
> Hello,
> Thanks for raising this issue, and for the detailed diagnostics. There was an
> error in the dumping process that created the protein_function_predictions table
> (and, incidentally, the protein_function_predictions_attrib table), whereby '\r'
> characters were erroneously removed.
>
> We have recreated the dump files for the human database, for releases 99, 100,
> and 101 (links below). Similar problems exist for some other species, which will
> be fixed later this week (we'll confirm to this list when this is done, and to
> which species).
>
> Apologies for the inconvenience, hopefully this resolves your problems, but
> please do get in touch if not.
>
> Cheers,
> James
> Ensembl Production team
>
>
> ftp://ftp.ensembl.org/pub/release-99/mysql/homo_sapiens_variation_99_38/
> ftp://ftp.ensembl.org/pub/release-100/mysql/homo_sapiens_variation_100_38/
> ftp://ftp.ensembl.org/pub/release-101/mysql/homo_sapiens_variation_101_38/
>
>
> On 16/09/2020 16:20, Chris Weichenberger wrote:
>> Dear all,
>>
>> since Ensembl release 99 we are facing errors when using SIFT and PolyPhen
>> scores from the variation database on a local installation of the human
>> Ensembl database (GRCh38). Our program uses the Ensembl Perl API and upon
>> SIFT score calculations we end up with an error message as follows (example
>> is using Ensembl API version 100):
>>
>> -------------------- EXCEPTION --------------------
>> MSG: Failed to gunzip:
>> STACK Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::expand_matrix
>> /usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:654
>>
>> STACK
>> Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::prediction_from_matrix
>> /usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:699
>>
>> STACK Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix::get_prediction
>> /usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/ProteinFunctionPredictionMatrix.pm:349
>>
>> STACK
>> Bio::EnsEMBL::Variation::TranscriptVariationAllele::_protein_function_prediction
>> /usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm:1227
>>
>> STACK Bio::EnsEMBL::Variation::TranscriptVariationAllele::sift_score
>> /usr/local/stow/ensembl-api-100/lib/site_perl/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm:1021
>>
>> [removed: the rest of the stack originating from our software]
>>
>> When we are connecting to the public Ensembl database via the Perl API, the
>> program works as expected and computes the scores. Debugging this problem
>> revealed that the binary data for the prediction matrix stored in our
>> database is different to what is stored in the official Ensembl database.
>>
>> $ mysql -h ensembldb.ensembl.org -u anonymous
>>
>> mysql> use homo_sapiens_variation_100_38
>> mysql> SELECT t.translation_md5, a.value, length(p.prediction_matrix)
>> FROM (protein_function_predictions p, translation_md5 t, attrib a)
>> WHERE t.translation_md5 = '6cfea960ade30a16e3f138d55d1eaf03' AND
>> a.value = 'sift' AND
>> p.translation_md5_id = t.translation_md5_id AND
>> p.analysis_attrib_id = a.attrib_id;
>>
>> +----------------------------------+-------+-----------------------------+
>> | translation_md5 | value | length(p.prediction_matrix) |
>> +----------------------------------+-------+-----------------------------+
>> | 6cfea960ade30a16e3f138d55d1eaf03 | sift | 16698 |
>> +----------------------------------+-------+-----------------------------+
>>
>> On our system however the size of the predication matrix is smaller:
>>
>> +----------------------------------+-------+-----------------------------+
>> | translation_md5 | value | length(p.prediction_matrix) |
>> +----------------------------------+-------+-----------------------------+
>> | 6cfea960ade30a16e3f138d55d1eaf03 | sift | 16631 |
>> +----------------------------------+-------+-----------------------------+
>>
>> We have locally saved the blobs from both databases by patching the Perl
>> API. These are files that are zlib-compressed and start with the magic
>> number "VEP" after uncompressing them. It turns out that the original
>> Ensembl blob can be uncompressed with the zcat command, whereas there are
>> error messages when trying to zcat our blob from the local database (CRC
>> checksum error and file size error reported by zcat).
>>
>> We then did a binary diff of the hexdump of both compressed blobs stored as
>> files (xxd and vimdiff). The *only* difference between our local blob and
>> the Ensembl blob is that ours is missing *all* 0x0d characters in the blob.
>> [see attached figure vimdiff-ing the hexdumps of the blobs as they are
>> stored in the database: left column, from public Ensembl database; right
>> column, from our local installation.] This also explains the smaller blob
>> size in our installation.
>>
>> The data were imported in the database exactly as described here:
>> https://www.ensembl.org/info/docs/webcode/mirror/install/ensembl-data.html
>> and the checksums were correctly verified with the "sum" utility.
>>
>> We discovered that up to release 98 queries to our internal installation
>> and to the public ensembldb.ensembl.org give identical results. Starting
>> with release 99, the results differ and we are getting the errors described
>> above.
>>
>> We are using MariaDB release 10.3.23 on Debian Bullseye (Testing), but we
>> also tried with a fresh MySQL installation from
>> https://dev.mysql.com/downloads/mysql/. We also tried using the "LOAD DATA
>> INFILE" instead of mysqlimport, always with the same results.
>>
>> Did anybody have similar experiences? Has the development team changed
>> anything in the format of the MySQL database dumps that affects blob
>> encoding? Might this be a pointer to a platform-specific issue, as only the
>> carriage return character '\r' (0x0d) is affected?
>>
>> Any help is highly appreciated. Thanks for sharing your thoughts on this.
>>
>> Chris and Daniele - EURAC research, Institute for Biomedicine
>>
>> _______________________________________________
>> Dev mailing list Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>> Ensembl Blog: http://www.ensembl.info/
>>
More information about the Dev
mailing list