[ensembl-dev] Suggestions re: Ensembl core schema
Andy Yates
ayates at ebi.ac.uk
Fri Oct 5 09:53:33 BST 2012
Hi Ken,
Thanks for the comments and bringing a lot of this to our attention. We will discuss this internally at the next available opportunity but would like to address some of the issues & points you've raised already. Just for the record, as I'm sure you know, we do not support any ORM access to the schemas other than the Ensembl API. I am sure you have your own reasons to use DBIx::Class but our primary concern is to keep the API functioning. Changes in schemas will mean knock on API changes.
To address your original points:
1). We do not create unnecessary surrogate keys; they are normally created to id an entity. Some tables do not have have their own ID such as analysis_description which is keyed off an analysis_id or an auto-incrementing ID is used to ID a group without a parent table (such as alt_allele). InnoDB internally will create a surrogate primary key for its own house keeping but Ensembl only supports MyISAM as a target schema. This does *not* mean people cannot use InnoDB (or any other MySQL storage engine such as XtraDB) but the support is limited.
2). Thanks for highlighting foreign key issues. We have found a few mis-matches (data file and dependent xrefs do not have the right data types defined) and missing indexes (mapping set) which will be fixed for 70. mapping_set is being redone anyway so we will ensure the new structure is compliant with the other available foreign keys
3). Data types have been chosen for a reason. To override all for the sake of it would be a costly patch & really not required IMHO. We will review the current data type usage (and at the very least ensure foreign keys match) & ensure that future data type choices are made for good reasons
As for the issue with intron_supporting_evidence; are you sure you are on the latest schema? I've just taken a look at the table.sql which shipped with 68 (and will ship with 69) looks like:
CREATE TABLE intron_supporting_evidence (
intron_supporting_evidence_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
analysis_id SMALLINT UNSIGNED NOT NULL,
seq_region_id INT(10) UNSIGNED NOT NULL,
seq_region_start INT(10) UNSIGNED NOT NULL,
seq_region_end INT(10) UNSIGNED NOT NULL,
seq_region_strand TINYINT(2) NOT NULL,
hit_name VARCHAR(100) NOT NULL,
score DECIMAL(10,3),
score_type ENUM('NONE', 'DEPTH') DEFAULT 'NONE',
is_splice_canonical BOOLEAN NOT NULL DEFAULT 0,
PRIMARY KEY (intron_supporting_evidence_id),
UNIQUE KEY (analysis_id, seq_region_id, seq_region_start, seq_region_end, seq_region_strand, hit_name),
KEY seq_region_idx (seq_region_id, seq_region_start)
) COLLATE=latin1_swedish_ci ENGINE=MyISAM;
Lastly InnoDB is not used due to operational reasons such as:
* InnoDB does not support transportable data files which are vital to our ability to move data efficiently between servers during our release production
* MyISAM shows far better performance when issuing LOAD DATA INFILE due to InnoDB's transactional enforcement
* Traditionally MyISAM has provided better read speed than InnoDB especially when the tables are fixed width.
I hope this answers your questions
Andy
Andrew Yates Ensembl Core Software Project Leader
EMBL-EBI Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK http://www.ensembl.org/
On 3 Oct 2012, at 17:44, Youens-Clark, Ken wrote:
> Following up on my suggestion yesterday to add primary keys to those tables missing them and to change all INTEGER types to be the same, I attach herewith a diff to do just that. Committing these changes should have no adverse effect on the schema or code as this is only adding functionality.
>
> Also, I found this in the "table.sql" file:
>
>> CREATE TABLE mapping_set (
>>
>> mapping_set_id INTEGER UNSIGNED NOT NULL,
>> schema_build VARCHAR(20) NOT NULL,
>>
>> PRIMARY KEY(schema_build)
>>
>> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
>
> I believe that must be wrong -- the PK should be the "mapping_set_id," right?
>
> Also, the "foreign_keys.sql" file has this:
>
>> ALTER table intron_supporting_evidence ADD FOREIGN KEY (analysis_id) REFERENCES analysis(analysis_id);
>> ALTER table intron_supporting_evidence ADD FOREIGN KEY (seq_region_id) REFERENCES seq_region(seq_region_id);
>
>
> But the "intron_supporting_evidence" table doesn't have those fields:
>
>> CREATE TABLE intron_supporting_evidence (
>> intron_supporting_evidence_id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT,
>> previous_exon_id INTEGER UNSIGNED NOT NULL,
>> next_exon_id INTEGER UNSIGNED NOT NULL,
>> hit_name VARCHAR(100) NOT NULL,
>> score DECIMAL(10,3),
>> score_type ENUM('NONE', 'DEPTH') DEFAULT 'NONE',
>>
>> PRIMARY KEY (intron_supporting_evidence_id),
>>
>> UNIQUE KEY (previous_exon_id, next_exon_id)
>> ) COLLATE=latin1_swedish_ci ENGINE=MyISAM;
>
> I will also attach a cleaned up version of the entire "table.sql" that has everything consistently formatted as it might prove easier to work with. Lastly, I will attach an InnoDB version that MySQL can actually use to create a version with referential integrity. I'd like to ask at this point why InnoDB isn't used?
>
> --
> Ken Youens-Clark
> kclark at cshl.edu
> Ware Lab/CSHL/USDA-ARS
>
> <fk-diff.txt><table-innodb.sql><table.sql.format><table-diff.txt>_______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
More information about the Dev
mailing list