[ensembl-dev] Patching an Ensembl compare database from 67 to 74

Matthieu Muffato muffato at ebi.ac.uk
Mon Jan 27 09:00:46 GMT 2014


Thank you Jan for letting us know

Matthieu

On 24/01/14 01:00, Jan Vogel wrote:
> Ah, I forgot there was another little problem with the *sequence_cds*
>   table  - attached the SQL i ran.
>
> Maybe it's worth editing the patch files …
>
>
> Regards,
>     Jan
>
>
> After fixing this, run the patches from the patch_67_68.sql file by
> hand. Next patch which fails:
>
>       ALTER TABLE sequence_cds
>       DROP PRIMARY KEY,
>       DROP KEY member_id,
>       DROP COLUMN sequence_cds_id,
>       ADD PRIMARY KEY (member_id);
>
>       ERROR 1062 (23000): Duplicate entry '2012449' for key 'PRIMARY'
>
> Fix the sequence_cds table:
>
>
> Check first if all is really duplicated:
>
>      igisdev -s -D ensembl_compara_67 -e'select member_id from
> sequence_cds GROUP BY member_id HAVING COUNT(*) > 1' >
> member_ids_duplicated.txt
>
>      gawk '{ print "select member_id, length, sequence_cds  from
> sequence_cds where member_id ="$1";" }' member_ids_duplicated.txt >
> member_ids.sql
>
>      igisdev -s -D ensembl_compara_67 < member_ids.sql > tmp.txt
>
>      wc -l tmp.txt
>
>      sort -u tmp.txt | wc -l
>
> Result : 5622 lines in the duplicated file, 2811 lines in the unique
> file => all entries are duplicates, so we can delete the
> duplicated items.
>
>
>        select member_id , count(*) from sequence_cds GROUP BY member_id
> HAVING COUNT(*) > 1
>        order by count(*) desc limit 3;
>
>        create table tmp_s_cds (sequence_cds_id int) ;
>
>        INSERT into tmp_s_cds
>        SELECT sequence_cds_id
>        FROM sequence_cds GROUP BY member_id HAVING COUNT(*) > 1;
>
>        select count(*) from tmp_s_cds ;
>
> Check: count(*) =  2811  , half of 5622 = OK !
>
>        DELETE FROM sequence_cds
>        USING tmp_s_cds , sequence_cds
>        WHERE tmp_s_cds.sequence_cds_id = sequence_cds.sequence_cds_id;
>
> Check:
>
>        SELECT * from  tmp_s_cds , sequence_cds
>        WHERE tmp_s_cds.sequence_cds_id = sequence_cds.sequence_cds_id ;
>
>        DROP table tmp_seb_id;
>
> Now, try to apply the patch again:
>
>        ALTER TABLE sequence_cds
>        DROP PRIMARY KEY,
>        DROP KEY member_id,
>        DROP COLUMN sequence_cds_id,
>        ADD PRIMARY KEY (member_id);
>
>
>
> On Jan 23, 2014, at 4:56 PM, Jan Vogel <jan.vogel at gmail.com
> <mailto:jan.vogel at gmail.com>> wrote:
>
>> Hello Matthieu et al,
>>
>> thanks for the comments - I managed to patch the database, but used a
>> slightly different SQL.
>>
>> Also thanks to the compara team and ensembl to supply these patch
>> files - we're computing our own gene sets, however we're patching all
>> in-house databases up to the most recent schema so we can take
>> advantage of all the bug fixes and improvements.
>>
>> Regards,
>>
>>     Jan
>>
>>
>>
>> Here's my log, if someone is interested :
>>
>> Schema patch fails due to duplicated member_ids in the
>> sequence_exon_bounded table:
>>
>>     ALTER TABLE sequence_exon_bounded
>>     DROP PRIMARY KEY, DROP KEY member_id,
>>     DROP COLUMN sequence_exon_bounded_id,
>>     ADD PRIMARY KEY (member_id)
>>
>>
>> There are some member_id duplicates in the table ... let's fix this -
>> details in IGIS
>>
>> Identify the max. number of duplications in the member_id group:
>>
>>     SELECT sequence_exon_bounded_id , count(*)
>>     FROM sequence_exon_bounded GROUP BY member_id HAVING COUNT(*) > 1
>>     order by count(*) desc limit 3;
>>
>> Create tmp table to store the sequence_exon_bounded_id identifiers for
>> deletion:
>>
>>     create table tmp_seb_id (sequence_exon_bounded_id int) ;
>>
>>     INSERT into tmp_seb_id
>>     SELECT sequence_exon_bounded_id
>>     FROM sequence_exon_bounded GROUP BY member_id HAVING COUNT(*) > 1;
>>
>>     DELETE FROM seb
>>     USING tmp_seb_id tmp_seb, sequence_exon_bounded seb
>>     WHERE tmp_seb.sequence_exon_bounded_id = seb.sequence_exon_bounded_id;
>>
>> Check:
>>
>>     SELECT * from  tmp_seb_id tmp_seb, sequence_exon_bounded seb
>>     WHERE tmp_seb.sequence_exon_bounded_id =
>> seb.sequence_exon_bounded_id ;
>>
>>     DROP table tmp_seb_id;
>>
>>
>>
>> On Jan 17, 2014, at 9:49 AM, Matthieu Muffato <muffato at ebi.ac.uk
>> <mailto:muffato at ebi.ac.uk>> wrote:
>>
>>> Dear Jan,
>>>
>>> There should be a single "exon bounded" sequence for each member_id
>>> and the API would anyway select only one of them.
>>> You can remove the duplicated entries with this query:
>>>
>>> DELETE FROM sequence_exon_bounded WHERE sequence_exon_bounded_id IN
>>> (SELECT sequence_exon_bounded_id FROM sequence_exon_bounded GROUP BY
>>> member_id HAVING COUNT(*) > 1);
>>>
>>> You may have to run it several times if some member_ids have 3 or
>>> more sequences
>>>
>>>
>>> Regarding Will's comment: Compara indeed only started distributing
>>> the patches as individual patches with the according meta keys a few
>>> releases ago.
>>>
>>> Best,
>>> Matthieu
>>>
>>> On 17/01/14 10:08, Will Chow wrote:
>>>> Hi Jan
>>>>
>>>> I noticed warnings as well patching from 66->73 for compara for a few
>>>> tables ( I don't quite remember which ones ).
>>>>
>>>> Of course this doesn't really help you or solve your issues, but since
>>>> our group doesn't use those tables I just ignored the patch warnings and
>>>> for our purposes, it seems to work.   I do notice in the meta table the
>>>> patch keys for the updates are not there below e70 even though the
>>>> patches are in the sql directory.
>>>>
>>>> again doesn't really help you I guess.
>>>>
>>>> Will
>>>>
>>>> On 17 Jan 2014, at 06:49, Jan Vogel <jan.vogel at gmail.com
>>>> <mailto:jan.vogel at gmail.com>
>>>> <mailto:jan.vogel at gmail.com>> wrote:
>>>>
>>>>>
>>>>> Hi there,
>>>>>
>>>>> i'm trying to patch our ensembl 67 compara database up to schema 74.
>>>>>
>>>>> It fails :-( when applying the
>>>>> *ensembl-compara/sql/patch_67_68.sql* patch as in the 67 schema there
>>>>> are duplicate member ids in the sequence_exon_bounded table. Any idea
>>>>> how to fix this and patch my e67 up to e68 ?
>>>>>
>>>>> Thanks,
>>>>>
>>>>>   Jan Vogel
>>>>>
>>>>>
>>>>> ERROR 1062 (23000) at line 28 in file:
>>>>> '/gnet/resfiles3/resapps/apps/ensembl/ensembl-74/ensembl-compara/sql/patch_67_68.sql':
>>>>> Duplicate entry '2012449' for key 'PRIMARY' :
>>>>>
>>>>> --------------
>>>>> SET session sql_mode='TRADITIONAL'
>>>>> --------------
>>>>> ……
>>>>> …..
>>>>>
>>>>>
>>>>> --------------
>>>>> ALTER TABLE sequence_exon_bounded
>>>>> DROP PRIMARY KEY, DROP KEY member_id,
>>>>> DROP COLUMN sequence_exon_bounded_id,
>>>>> ADD PRIMARY KEY (member_id)
>>>>> --------------
>>>>>
>>>>> Failed to apply patch 'patch_67_68.sql' to database
>>>>> 'ensembl_compara_67'!
>>>>> In non-interative mode; aborting current run at schema_patcher.pl line
>>>>> 643.
>>>>>
>>>>>
>>>>> Looking into the sequence_exon_bounded table, it looks that some
>>>>> member ids are duplicated:
>>>>>
>>>>> mysql> select count(distinct(member_id))  from sequence_exon_bounded
>>>>> limit 10 ;
>>>>> +----------------------------+
>>>>> | count(distinct(member_id)) |
>>>>> +----------------------------+
>>>>> |                    1371152 |
>>>>> +----------------------------+
>>>>> 1 row in set (10.55 sec)
>>>>>
>>>>> mysql> select count(member_id)  from sequence_exon_bounded limit 10 ;
>>>>> +------------------+
>>>>> | count(member_id) |
>>>>> +------------------+
>>>>> |          1373719 |
>>>>> +------------------+
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>>> <mailto:Dev at ensembl.org>
>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>
>>>
>>> --
>>> Matthieu Muffato, Ph.D.
>>> Ensembl Developer and Ensembl Compara Manager
>>> European Bioinformatics Institute (EMBL-EBI)
>>> European Molecular Biology Laboratory
>>> Wellcome Trust Genome Campus, Hinxton
>>> Cambridge, CB10 1SD, United Kingdom
>>> Room  A3-145
>>> Phone + 44 (0) 1223 49 4631
>>> Fax   + 44 (0) 1223 49 4468
>>>
>>> _______________________________________________
>>> Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>


-- 
Matthieu Muffato, Ph.D.
Ensembl Developer and Ensembl Compara Manager
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus, Hinxton
Cambridge, CB10 1SD, United Kingdom
Room  A3-145
Phone + 44 (0) 1223 49 4631
Fax   + 44 (0) 1223 49 4468




More information about the Dev mailing list