[ensembl-dev] Suggestions re: Ensembl core schema

Fri Oct 5 09:53:33 BST 2012

Hi Ken,

Thanks for the comments and bringing a lot of this to our attention. We will discuss this internally at the next available opportunity but would like to address some of the issues & points you've raised already. Just for the record, as I'm sure you know, we do not support any ORM access to the schemas other than the Ensembl API. I am sure you have your own reasons to use DBIx::Class but our primary concern is to keep the API functioning. Changes in schemas will mean knock on API changes. 

To address your original points:

1). We do not create unnecessary surrogate keys; they are normally created to id an entity. Some tables do not have have their own ID such as analysis_description which is keyed off an analysis_id or an auto-incrementing ID is used to ID a group without a parent table (such as alt_allele). InnoDB internally will create a surrogate primary key for its own house keeping but Ensembl only supports MyISAM as a target schema. This does *not* mean people cannot use InnoDB (or any other MySQL storage engine such as XtraDB) but the support is limited.

2). Thanks for highlighting foreign key issues. We have found a few mis-matches (data file and dependent xrefs do not have the right data types defined) and missing indexes (mapping set) which will be fixed for 70. mapping_set is being redone anyway so we will ensure the new structure is compliant with the other available foreign keys

3). Data types have been chosen for a reason. To override all for the sake of it would be a costly patch & really not required IMHO. We will review the current data type usage (and at the very least ensure foreign keys match) & ensure that future data type choices are made for good reasons

As for the issue with intron_supporting_evidence; are you sure you are on the latest schema? I've just taken a look at the table.sql which shipped with 68 (and will ship with 69) looks like:

CREATE TABLE intron_supporting_evidence (
	intron_supporting_evidence_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
	analysis_id                   SMALLINT UNSIGNED NOT NULL,
	seq_region_id                 INT(10) UNSIGNED NOT NULL,
	seq_region_start              INT(10) UNSIGNED NOT NULL,
	seq_region_end                INT(10) UNSIGNED NOT NULL,
	seq_region_strand             TINYINT(2) NOT NULL,
	hit_name                      VARCHAR(100) NOT NULL,
	score                         DECIMAL(10,3),
	score_type                    ENUM('NONE', 'DEPTH') DEFAULT 'NONE',
	is_splice_canonical           BOOLEAN NOT NULL DEFAULT 0,

	PRIMARY KEY (intron_supporting_evidence_id),

	UNIQUE KEY (analysis_id, seq_region_id, seq_region_start, seq_region_end, seq_region_strand, hit_name),
	KEY seq_region_idx (seq_region_id, seq_region_start)

) COLLATE=latin1_swedish_ci ENGINE=MyISAM;

Lastly InnoDB is not used due to operational reasons such as:

* InnoDB does not support transportable data files which are vital to our ability to move data efficiently between servers during our release production

* MyISAM shows far better performance when issuing LOAD DATA INFILE due to InnoDB's transactional enforcement

* Traditionally MyISAM has provided better read speed than InnoDB especially when the tables are fixed width.

I hope this answers your questions

Andy

Andrew Yates                   Ensembl Core Software Project Leader
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensembl.org/

On 3 Oct 2012, at 17:44, Youens-Clark, Ken wrote:

> Following up on my suggestion yesterday to add primary keys to those tables missing them and to change all INTEGER types to be the same, I attach herewith a diff to do just that.  Committing these changes should have no adverse effect on the schema or code as this is only adding functionality.  
> 
> Also, I found this in the "table.sql" file:
> 
>> CREATE TABLE mapping_set (
>> 
>>        mapping_set_id  INTEGER UNSIGNED NOT NULL,
>>        schema_build    VARCHAR(20) NOT NULL,
>> 
>>        PRIMARY KEY(schema_build)
>> 
>> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
> 
> I believe that must be wrong -- the PK should be the "mapping_set_id," right?
> 
> Also, the "foreign_keys.sql" file has this:
> 
>> ALTER table intron_supporting_evidence ADD FOREIGN KEY (analysis_id) REFERENCES analysis(analysis_id);
>> ALTER table intron_supporting_evidence ADD FOREIGN KEY (seq_region_id) REFERENCES seq_region(seq_region_id);
> 
> 
> But the "intron_supporting_evidence" table doesn't have those fields:
> 
>> CREATE TABLE intron_supporting_evidence (
>>  intron_supporting_evidence_id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT,
>>  previous_exon_id INTEGER UNSIGNED NOT NULL,
>>  next_exon_id INTEGER UNSIGNED NOT NULL,
>>  hit_name VARCHAR(100) NOT NULL,
>>  score DECIMAL(10,3),
>>  score_type ENUM('NONE', 'DEPTH') DEFAULT 'NONE',
>> 
>>  PRIMARY KEY (intron_supporting_evidence_id),
>> 
>>  UNIQUE KEY (previous_exon_id, next_exon_id)
>> ) COLLATE=latin1_swedish_ci ENGINE=MyISAM;
> 
> I will also attach a cleaned up version of the entire "table.sql" that has everything consistently formatted as it might prove easier to work with.  Lastly, I will attach an InnoDB version that MySQL can actually use to create a version with referential integrity.  I'd like to ask at this point why InnoDB isn't used?
> 
> --
> Ken Youens-Clark
> kclark at cshl.edu
> Ware Lab/CSHL/USDA-ARS
> 
> <fk-diff.txt><table-innodb.sql><table.sql.format><table-diff.txt>_______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/