[ensembl-dev] Perl BLOBs in the ensembl compara database

PATERSON Trevor trevor.paterson at roslin.ed.ac.uk
Fri Jan 21 16:42:10 GMT 2011


Thanks Andy(s)

I may play about with trying to translate the BLOBS next week.

As you point out the actual type/format of the BLOB is determined when the data is stored by the PerlAPI itself.

If this does turn out to be OS specific it could be unpleasant to write robust Java code that manages to retrieve floats from the BLOBs on any platform.

And as the datatype is not determined by the database schema but by the PerlAPI, maintaining Java access code then relies on keeping abreast of changes within the internals of the PerlAPI aswell as following any schema evolution.

(I'm not liking it :)

trevor


 



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


-----Original Message-----
From: Andy Jenkinson [mailto:andy.jenkinson at ebi.ac.uk] 
Sent: 21 January 2011 15:46
To: PATERSON Trevor
Subject: Re: [ensembl-dev] Perl BLOBs in the ensembl compara database

Hi Trevor,

I'm pretty sure the blob will just be a binary representation of a series of floating point numbers, represented depending on the machine used to pack them. That is, it's unlikely to actually be some sort of binary representation of a special Perl data structure (Perl floats aren't objects remember). That is, every 64 bits will be a little endian float (probably). Just unpack them in Java and see what happens.

Cheers,
Andy

On 21 Jan 2011, at 15:29, Andy Yates wrote:

> Hi Trevor,
> 
> There are quite a few instances of BLOB & pack/unpack usage. A quick search of the main APIs shows up hits in Compara, functional genomics and variation. The thing to note here is that the Perl documentation says about the f option in pack:
> 
> A single-precision float in native format.
> 
> This could mean native format as in a Perl native format (unlikely as the F option explicitly mentions Perl) so that would leave us with native format meaning the endian used. You would have to guess that little-endian is going to be the answer there. 
> 
> Have you tried to decode these BLOBs to Java floats yet? Also what 
> machine type are you on? You may find an issue if you're doing this on 
> a windows box and will have to force the endian to little (which in 
> Java means using an nio ByteBuffer IIRC)
> 
> Andy
> 
> On 21 Jan 2011, at 14:28, PATERSON Trevor wrote:
> 
>> I have been trying to get to grips with the Compara schema in order to think about writing Java libraries to access the data...
>> 
>> However it appears that some of the data in Compara is more intimately wedded to Perl than I had hoped!
>> 
>> Looking at Genomic Alignment data, the Conservation Score values (which I think can be a variable length array of floats) are stored as BLOBS, packed internal representations of Perl floats.... and therefore require Perl to unpack them.
>> 
>> quickly scanning through the schema I don't see any other fields of type BLOB. 
>> 
>> My understanding is that these values are probably dumped here using 'pack' as a quick 'hack' to avoid having to deal with variable length arrays.
>> 
>> Unfortunately it does, however, rather tie the data to Perl.
>> 
>> Is this a design decision  - or just a historical accident?
>> Are there (or will there be.. ) any other examples of Perl BLOBs in Ensembl?
>> 
>> 
>> cheers
>> 
>> Trevor
>> 
>> 
>> 
>> mysql> describe conservation_score;
>> +------------------------+----------------------+------+-----+---------+-------+
>> | Field                  | Type                 | Null | Key | Default | Extra |
>> +------------------------+----------------------+------+-----+---------+-------+
>> | genomic_align_block_id | bigint(20) unsigned  | NO   | MUL | NULL    |       |
>> | window_size            | smallint(5) unsigned | NO   |     | NULL    |       |
>> | position               | int(10) unsigned     | NO   |     | NULL    |       |
>> | expected_score         | blob                 | YES  |     | NULL    |       |
>> | diff_score             | blob                 | YES  |     | NULL    |       |
>> +------------------------+----------------------+------+-----+---------+-------+
>> 
>> 
>> 
>> Trevor Paterson PhD
>> email trevor.paterson at roslin.ed.ac.uk 
>> <mailto:trevor.paterson at roslin.ed.ac.uk>
>> 
>> Bioinformatics
>> The Roslin Institute
>> The Royal (Dick) School of Veterinary Studies University of Edinburgh 
>> Scotland EH25 9PS phone +44 (0)131 5274197 
>> http://bioinformatics.roslin.ed.ac.uk/ 
>> <http://bioinformatics.roslin.ed.ac.uk/>
>> 
>> Please consider the environment before printing this e-mail
>> 
>> The University of Edinburgh is a charitable body, registered in 
>> Scotland with registration number SC005336 Disclaimer:This e-mail and any attachments are confidential and intended solely for the use of the recipient(s) to whom they are addressed. If you have received it in error, please destroy all copies and inform the sender.
>> 
>> 
>> --
>> The University of Edinburgh is a charitable body, registered in 
>> Scotland, with registration number SC005336.
>> 
>> 
>> _______________________________________________
>> Dev mailing list
>> Dev at ensembl.org
>> http://lists.ensembl.org/mailman/listinfo/dev
> 
> -- 
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> 
> 
> 
> 
> 
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev





More information about the Dev mailing list