[ensembl-dev] Patching an Ensembl compare database from 67 to 74

Jan Vogel jan.vogel at gmail.com
Fri Jan 24 01:00:36 GMT 2014


Ah, I forgot there was another little problem with the sequence_cds  table  - attached the SQL i ran. 

Maybe it's worth editing the patch files … 


Regards,
   Jan 


After fixing this, run the patches from the patch_67_68.sql file by hand. Next patch which fails:

     ALTER TABLE sequence_cds
     DROP PRIMARY KEY,
     DROP KEY member_id,
     DROP COLUMN sequence_cds_id,
     ADD PRIMARY KEY (member_id); 

     ERROR 1062 (23000): Duplicate entry '2012449' for key 'PRIMARY'

Fix the sequence_cds table:


Check first if all is really duplicated:

    igisdev -s -D ensembl_compara_67 -e'select member_id from sequence_cds GROUP BY member_id HAVING COUNT(*) > 1' > member_ids_duplicated.txt

    gawk '{ print "select member_id, length, sequence_cds  from sequence_cds where member_id ="$1";" }' member_ids_duplicated.txt > member_ids.sql  

    igisdev -s -D ensembl_compara_67 < member_ids.sql > tmp.txt 

    wc -l tmp.txt  

    sort -u tmp.txt | wc -l  

Result : 5622 lines in the duplicated file, 2811 lines in the unique file => all entries are duplicates, so we can delete the
duplicated items.


      select member_id , count(*) from sequence_cds GROUP BY member_id HAVING COUNT(*) > 1
      order by count(*) desc limit 3;

      create table tmp_s_cds (sequence_cds_id int) ;  

      INSERT into tmp_s_cds 
      SELECT sequence_cds_id 
      FROM sequence_cds GROUP BY member_id HAVING COUNT(*) > 1;

      select count(*) from tmp_s_cds ; 

Check: count(*) =  2811  , half of 5622 = OK !

      DELETE FROM sequence_cds 
      USING tmp_s_cds , sequence_cds 
      WHERE tmp_s_cds.sequence_cds_id = sequence_cds.sequence_cds_id;

Check:

      SELECT * from  tmp_s_cds , sequence_cds
      WHERE tmp_s_cds.sequence_cds_id = sequence_cds.sequence_cds_id ; 

      DROP table tmp_seb_id; 

Now, try to apply the patch again:

      ALTER TABLE sequence_cds
      DROP PRIMARY KEY,
      DROP KEY member_id,
      DROP COLUMN sequence_cds_id,
      ADD PRIMARY KEY (member_id); 



On Jan 23, 2014, at 4:56 PM, Jan Vogel <jan.vogel at gmail.com> wrote:

> Hello Matthieu et al, 
> 
> thanks for the comments - I managed to patch the database, but used a slightly different SQL. 
> 
> Also thanks to the compara team and ensembl to supply these patch files - we're computing our own gene sets, however we're patching all in-house databases up to the most recent schema so we can take advantage of all the bug fixes and improvements. 
> 
> Regards, 
> 
>     Jan 
> 
> 
> 
> Here's my log, if someone is interested : 
> 
> Schema patch fails due to duplicated member_ids in the sequence_exon_bounded table:
> 
>     ALTER TABLE sequence_exon_bounded 
>     DROP PRIMARY KEY, DROP KEY member_id, 
>     DROP COLUMN sequence_exon_bounded_id, 
>     ADD PRIMARY KEY (member_id)
> 
> 
> There are some member_id duplicates in the table ... let's fix this - details in IGIS
> 
> Identify the max. number of duplications in the member_id group:
> 
>     SELECT sequence_exon_bounded_id , count(*)
>     FROM sequence_exon_bounded GROUP BY member_id HAVING COUNT(*) > 1
>     order by count(*) desc limit 3;
> 
> Create tmp table to store the sequence_exon_bounded_id identifiers for deletion:
> 
>     create table tmp_seb_id (sequence_exon_bounded_id int) ;  
> 
>     INSERT into tmp_seb_id 
>     SELECT sequence_exon_bounded_id 
>     FROM sequence_exon_bounded GROUP BY member_id HAVING COUNT(*) > 1;
> 
>     DELETE FROM seb 
>     USING tmp_seb_id tmp_seb, sequence_exon_bounded seb 
>     WHERE tmp_seb.sequence_exon_bounded_id = seb.sequence_exon_bounded_id;
> 
> Check:
> 
>     SELECT * from  tmp_seb_id tmp_seb, sequence_exon_bounded seb 
>     WHERE tmp_seb.sequence_exon_bounded_id = seb.sequence_exon_bounded_id ; 
> 
>     DROP table tmp_seb_id; 
> 
> 
> 
> On Jan 17, 2014, at 9:49 AM, Matthieu Muffato <muffato at ebi.ac.uk> wrote:
> 
>> Dear Jan,
>> 
>> There should be a single "exon bounded" sequence for each member_id and the API would anyway select only one of them.
>> You can remove the duplicated entries with this query:
>> 
>> DELETE FROM sequence_exon_bounded WHERE sequence_exon_bounded_id IN (SELECT sequence_exon_bounded_id FROM sequence_exon_bounded GROUP BY member_id HAVING COUNT(*) > 1);
>> 
>> You may have to run it several times if some member_ids have 3 or more sequences
>> 
>> 
>> Regarding Will's comment: Compara indeed only started distributing the patches as individual patches with the according meta keys a few releases ago.
>> 
>> Best,
>> Matthieu
>> 
>> On 17/01/14 10:08, Will Chow wrote:
>>> Hi Jan
>>> 
>>> I noticed warnings as well patching from 66->73 for compara for a few
>>> tables ( I don't quite remember which ones ).
>>> 
>>> Of course this doesn't really help you or solve your issues, but since
>>> our group doesn't use those tables I just ignored the patch warnings and
>>> for our purposes, it seems to work.   I do notice in the meta table the
>>> patch keys for the updates are not there below e70 even though the
>>> patches are in the sql directory.
>>> 
>>> again doesn't really help you I guess.
>>> 
>>> Will
>>> 
>>> On 17 Jan 2014, at 06:49, Jan Vogel <jan.vogel at gmail.com
>>> <mailto:jan.vogel at gmail.com>> wrote:
>>> 
>>>> 
>>>> Hi there,
>>>> 
>>>> i'm trying to patch our ensembl 67 compara database up to schema 74.
>>>> 
>>>> It fails :-( when applying the
>>>> *ensembl-compara/sql/patch_67_68.sql* patch as in the 67 schema there
>>>> are duplicate member ids in the sequence_exon_bounded table. Any idea
>>>> how to fix this and patch my e67 up to e68 ?
>>>> 
>>>> Thanks,
>>>> 
>>>>   Jan Vogel
>>>> 
>>>> 
>>>> ERROR 1062 (23000) at line 28 in file:
>>>> '/gnet/resfiles3/resapps/apps/ensembl/ensembl-74/ensembl-compara/sql/patch_67_68.sql':
>>>> Duplicate entry '2012449' for key 'PRIMARY' :
>>>> 
>>>> --------------
>>>> SET session sql_mode='TRADITIONAL'
>>>> --------------
>>>> ……
>>>> …..
>>>> 
>>>> 
>>>> --------------
>>>> ALTER TABLE sequence_exon_bounded
>>>> DROP PRIMARY KEY, DROP KEY member_id,
>>>> DROP COLUMN sequence_exon_bounded_id,
>>>> ADD PRIMARY KEY (member_id)
>>>> --------------
>>>> 
>>>> Failed to apply patch 'patch_67_68.sql' to database 'ensembl_compara_67'!
>>>> In non-interative mode; aborting current run at schema_patcher.pl line
>>>> 643.
>>>> 
>>>> 
>>>> Looking into the sequence_exon_bounded table, it looks that some
>>>> member ids are duplicated:
>>>> 
>>>> mysql> select count(distinct(member_id))  from sequence_exon_bounded
>>>> limit 10 ;
>>>> +----------------------------+
>>>> | count(distinct(member_id)) |
>>>> +----------------------------+
>>>> |                    1371152 |
>>>> +----------------------------+
>>>> 1 row in set (10.55 sec)
>>>> 
>>>> mysql> select count(member_id)  from sequence_exon_bounded limit 10 ;
>>>> +------------------+
>>>> | count(member_id) |
>>>> +------------------+
>>>> |          1373719 |
>>>> +------------------+
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>> 
>> 
>> 
>> -- 
>> Matthieu Muffato, Ph.D.
>> Ensembl Developer and Ensembl Compara Manager
>> European Bioinformatics Institute (EMBL-EBI)
>> European Molecular Biology Laboratory
>> Wellcome Trust Genome Campus, Hinxton
>> Cambridge, CB10 1SD, United Kingdom
>> Room  A3-145
>> Phone + 44 (0) 1223 49 4631
>> Fax   + 44 (0) 1223 49 4468
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140123/62d30a2b/attachment.html>


More information about the Dev mailing list