[ensembl-dev] Patching an Ensembl compare database from 67 to 74
Jan Vogel
jan.vogel at gmail.com
Fri Jan 24 01:00:36 GMT 2014
Ah, I forgot there was another little problem with the sequence_cds table - attached the SQL i ran.
Maybe it's worth editing the patch files …
Regards,
Jan
After fixing this, run the patches from the patch_67_68.sql file by hand. Next patch which fails:
ALTER TABLE sequence_cds
DROP PRIMARY KEY,
DROP KEY member_id,
DROP COLUMN sequence_cds_id,
ADD PRIMARY KEY (member_id);
ERROR 1062 (23000): Duplicate entry '2012449' for key 'PRIMARY'
Fix the sequence_cds table:
Check first if all is really duplicated:
igisdev -s -D ensembl_compara_67 -e'select member_id from sequence_cds GROUP BY member_id HAVING COUNT(*) > 1' > member_ids_duplicated.txt
gawk '{ print "select member_id, length, sequence_cds from sequence_cds where member_id ="$1";" }' member_ids_duplicated.txt > member_ids.sql
igisdev -s -D ensembl_compara_67 < member_ids.sql > tmp.txt
wc -l tmp.txt
sort -u tmp.txt | wc -l
Result : 5622 lines in the duplicated file, 2811 lines in the unique file => all entries are duplicates, so we can delete the
duplicated items.
select member_id , count(*) from sequence_cds GROUP BY member_id HAVING COUNT(*) > 1
order by count(*) desc limit 3;
create table tmp_s_cds (sequence_cds_id int) ;
INSERT into tmp_s_cds
SELECT sequence_cds_id
FROM sequence_cds GROUP BY member_id HAVING COUNT(*) > 1;
select count(*) from tmp_s_cds ;
Check: count(*) = 2811 , half of 5622 = OK !
DELETE FROM sequence_cds
USING tmp_s_cds , sequence_cds
WHERE tmp_s_cds.sequence_cds_id = sequence_cds.sequence_cds_id;
Check:
SELECT * from tmp_s_cds , sequence_cds
WHERE tmp_s_cds.sequence_cds_id = sequence_cds.sequence_cds_id ;
DROP table tmp_seb_id;
Now, try to apply the patch again:
ALTER TABLE sequence_cds
DROP PRIMARY KEY,
DROP KEY member_id,
DROP COLUMN sequence_cds_id,
ADD PRIMARY KEY (member_id);
On Jan 23, 2014, at 4:56 PM, Jan Vogel <jan.vogel at gmail.com> wrote:
> Hello Matthieu et al,
>
> thanks for the comments - I managed to patch the database, but used a slightly different SQL.
>
> Also thanks to the compara team and ensembl to supply these patch files - we're computing our own gene sets, however we're patching all in-house databases up to the most recent schema so we can take advantage of all the bug fixes and improvements.
>
> Regards,
>
> Jan
>
>
>
> Here's my log, if someone is interested :
>
> Schema patch fails due to duplicated member_ids in the sequence_exon_bounded table:
>
> ALTER TABLE sequence_exon_bounded
> DROP PRIMARY KEY, DROP KEY member_id,
> DROP COLUMN sequence_exon_bounded_id,
> ADD PRIMARY KEY (member_id)
>
>
> There are some member_id duplicates in the table ... let's fix this - details in IGIS
>
> Identify the max. number of duplications in the member_id group:
>
> SELECT sequence_exon_bounded_id , count(*)
> FROM sequence_exon_bounded GROUP BY member_id HAVING COUNT(*) > 1
> order by count(*) desc limit 3;
>
> Create tmp table to store the sequence_exon_bounded_id identifiers for deletion:
>
> create table tmp_seb_id (sequence_exon_bounded_id int) ;
>
> INSERT into tmp_seb_id
> SELECT sequence_exon_bounded_id
> FROM sequence_exon_bounded GROUP BY member_id HAVING COUNT(*) > 1;
>
> DELETE FROM seb
> USING tmp_seb_id tmp_seb, sequence_exon_bounded seb
> WHERE tmp_seb.sequence_exon_bounded_id = seb.sequence_exon_bounded_id;
>
> Check:
>
> SELECT * from tmp_seb_id tmp_seb, sequence_exon_bounded seb
> WHERE tmp_seb.sequence_exon_bounded_id = seb.sequence_exon_bounded_id ;
>
> DROP table tmp_seb_id;
>
>
>
> On Jan 17, 2014, at 9:49 AM, Matthieu Muffato <muffato at ebi.ac.uk> wrote:
>
>> Dear Jan,
>>
>> There should be a single "exon bounded" sequence for each member_id and the API would anyway select only one of them.
>> You can remove the duplicated entries with this query:
>>
>> DELETE FROM sequence_exon_bounded WHERE sequence_exon_bounded_id IN (SELECT sequence_exon_bounded_id FROM sequence_exon_bounded GROUP BY member_id HAVING COUNT(*) > 1);
>>
>> You may have to run it several times if some member_ids have 3 or more sequences
>>
>>
>> Regarding Will's comment: Compara indeed only started distributing the patches as individual patches with the according meta keys a few releases ago.
>>
>> Best,
>> Matthieu
>>
>> On 17/01/14 10:08, Will Chow wrote:
>>> Hi Jan
>>>
>>> I noticed warnings as well patching from 66->73 for compara for a few
>>> tables ( I don't quite remember which ones ).
>>>
>>> Of course this doesn't really help you or solve your issues, but since
>>> our group doesn't use those tables I just ignored the patch warnings and
>>> for our purposes, it seems to work. I do notice in the meta table the
>>> patch keys for the updates are not there below e70 even though the
>>> patches are in the sql directory.
>>>
>>> again doesn't really help you I guess.
>>>
>>> Will
>>>
>>> On 17 Jan 2014, at 06:49, Jan Vogel <jan.vogel at gmail.com
>>> <mailto:jan.vogel at gmail.com>> wrote:
>>>
>>>>
>>>> Hi there,
>>>>
>>>> i'm trying to patch our ensembl 67 compara database up to schema 74.
>>>>
>>>> It fails :-( when applying the
>>>> *ensembl-compara/sql/patch_67_68.sql* patch as in the 67 schema there
>>>> are duplicate member ids in the sequence_exon_bounded table. Any idea
>>>> how to fix this and patch my e67 up to e68 ?
>>>>
>>>> Thanks,
>>>>
>>>> Jan Vogel
>>>>
>>>>
>>>> ERROR 1062 (23000) at line 28 in file:
>>>> '/gnet/resfiles3/resapps/apps/ensembl/ensembl-74/ensembl-compara/sql/patch_67_68.sql':
>>>> Duplicate entry '2012449' for key 'PRIMARY' :
>>>>
>>>> --------------
>>>> SET session sql_mode='TRADITIONAL'
>>>> --------------
>>>> ……
>>>> …..
>>>>
>>>>
>>>> --------------
>>>> ALTER TABLE sequence_exon_bounded
>>>> DROP PRIMARY KEY, DROP KEY member_id,
>>>> DROP COLUMN sequence_exon_bounded_id,
>>>> ADD PRIMARY KEY (member_id)
>>>> --------------
>>>>
>>>> Failed to apply patch 'patch_67_68.sql' to database 'ensembl_compara_67'!
>>>> In non-interative mode; aborting current run at schema_patcher.pl line
>>>> 643.
>>>>
>>>>
>>>> Looking into the sequence_exon_bounded table, it looks that some
>>>> member ids are duplicated:
>>>>
>>>> mysql> select count(distinct(member_id)) from sequence_exon_bounded
>>>> limit 10 ;
>>>> +----------------------------+
>>>> | count(distinct(member_id)) |
>>>> +----------------------------+
>>>> | 1371152 |
>>>> +----------------------------+
>>>> 1 row in set (10.55 sec)
>>>>
>>>> mysql> select count(member_id) from sequence_exon_bounded limit 10 ;
>>>> +------------------+
>>>> | count(member_id) |
>>>> +------------------+
>>>> | 1373719 |
>>>> +------------------+
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>
>>
>> --
>> Matthieu Muffato, Ph.D.
>> Ensembl Developer and Ensembl Compara Manager
>> European Bioinformatics Institute (EMBL-EBI)
>> European Molecular Biology Laboratory
>> Wellcome Trust Genome Campus, Hinxton
>> Cambridge, CB10 1SD, United Kingdom
>> Room A3-145
>> Phone + 44 (0) 1223 49 4631
>> Fax + 44 (0) 1223 49 4468
>>
>> _______________________________________________
>> Dev mailing list Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140123/62d30a2b/attachment.html>
More information about the Dev
mailing list