[ensembl-dev] [BioMart Users] how to get from ensembl main database schema to ensembl mart schema

Rhoda Kinsella rhoda at ebi.ac.uk
Mon Mar 14 11:39:14 GMT 2011


Hi Andrea,
Please find attached a link to the xml files used to create 6 of the 7  
Ensembl marts for release 61 (the sequence mart is created using a  
script so there is no xml to send). The xmls are for the four visible  
marts on the interface (Ensembl Gene, Ensembl Variation, Ensembl  
Regulation and Vega mart) as well as two marts that the user does not  
see on the mart interface, but which are accessed via the visible  
marts (ontology mart, genomic features mart). We are currently using  
Biomart version 0.7. As Arek mentioned,  the Ensembl marts are quite  
complex and we have a few code hacks in place as well as pre and post  
build patches to run in order to create the databases. Therefore it is  
probably best if you use the martbuilder tool to see how the mart  
schemas are created from the core databases and then play with a  
simple schema and see how you get on.
Kind regards
Rhoda

The xml files can be found here:
http://www.ebi.ac.uk/~rhoda/v61_xml/



On 12 Mar 2011, at 19:45, Arek Kasprzyk wrote:

>
> Hi Andrea,
> Ok, I have a better idea now what you want. The situation is as  
> follows: 0.8 rc5 has automated and integrated nicely a lot of  
> workflows needed to create a new mart from a source schema. However  
> the particular Ensembl core transformation is a very complex one and  
> rc5 still has only a rudimentary support for that. If you just want  
> to have an idea how the algorithm works it is better to start with a  
> simpler use case not Ensembl mart . It will be difficult to   
> recreate the exact Ensembl mart transformation from scratch for two  
> reasons: rc5 has still a very rudimentary support for this  
> pariticular schema so you will not get far. The 0.7 fully supports  
> it but thereis  a large number of 'tweaks aka hacks' to the  
> transformation algorithm to get certain things to work so you will  
> find difficult to recreate a lot of them.
> I would advise you to play with any schema to get a few datasets to  
> work (ensembl core schema is fine to play with too). If you want to  
> build new marts and integrate them with ensembl definitely go for  
> rc5 and treat the existing ensembl  mart as a black box, the  
> software will provide the means to integrate it nicely through a  
> backwards compatibility mechanism with your newly created mart.This  
> is much easier in 0.8 rc5. If you however just want to see how the  
> ensembl mart transformation is achieved exactly you will need a 0.7  
> XML transformation file from the Ensembl team.
>
> FYI: A short description of the basic transformation algorithm below:
>
> starting from one or more input “candidate” table, the software  
> finds the largest set of table joins it can perform using only 1:1  
> and many-to-one (M:1) relations, and merges these tables together to  
> create  the main table. Multiple candidate tables can be given as  
> input, in which case the algorithm creates main tables out of each  
> selected candidate table and if unable to do so will create several  
> separate datasets. Once the main tables are completed, if there is a  
> 1:M relation between them they become main and sub-main tables. If  
> there is now 1:M relation between them, they are split into separate  
> datasets. Any tables that have a 1:M or many-to-many (M:N) relation  
> with the newly-created main table or sub-main table are made into  
> independent dimension tables.
>
>
> Please let us know if you have any more questions,
> a
> Arek Kasprzyk
> Director, Bioinformatics Operations and Principal Investigator
>
> Ontario Institute for Cancer Research
> MaRS Centre, South Tower
> 101 College Street, Suite 800
> Toronto, Ontario, Canada M5G 0A3
>
> Tel:       416-673-8559
> Toll-free:           1-866-678-6427
> www.oicr.on.ca
>
>
>
> Administrative Assistant: Natasha.Lander at oicr.on.ca
>
> This message and any attachments may contain confidential and/or  
> privileged information for the sole use of the intended recipient.  
> Any review or distribution by anyone other than the person for whom  
> it was originally intended is strictly prohibited. If you have  
> received this message in error, please contact the sender and delete  
> all copies. Opinions, conclusions or other information contained in  
> this message may not be that of the organization.
>
> From: Andrea Edwards <edwardsa at cs.man.ac.uk>
> Date: Sat, 12 Mar 2011 14:13:24 -0500
> To: "users at biomart.org" <users at biomart.org>
> Subject: Re: [BioMart Users] how to get from ensembl main database  
> schema to ensembl mart schema
>
> Hi
>
> I am not concerned whether I use biomart 0.7 or 0.8 - whichever is
> easiest for what I would like to do. I havent done anything yet and  
> I'm
> starting from scratch.
>
> All i want to do is have a go at re-creating the ensembl mart from the
> ensembl core databases. I wanted to do this because ensembl is an
> example of a database whose schema I am familiar with and whose mart I
> have used. I wanted to do this for 2 reasons:
> a) to get some practice
> b) to get an intuitition of what type of mart I can create from my own
> database schema and what types of query I can run and what the
> filters/attributes will be
> c) get an idea of how i could integrate my database with ensembl as I
> believe they only need to share ids or underlying assembly to b  
> integrated
>
> Will i be able to recreate the ensembl mart in biomart 0.8? I presume
> the ensembl xml files are available for 0.7 and I won;t be able to  
> read
> them in 0.8? Without these files how will i know the exact steps  
> ensembl
> used to specify their mart structure? How will i know what main tables
> they chose or how for example they created the PRINTS dimension table
> mentioned in my original query?
>
> Thanks a lot
>
>
> On 12/03/2011 18:59, Arek Kasprzyk wrote:
>> Putting this back on the list to keep everyone else in the loop
>>
>> a
>>
>>
>>
>> On 2011-03-12, at 13:56, "Arek Kasprzyk"<Arek.Kasprzyk at oicr.on.ca>
>> wrote:
>>
>>> If you are starting from scratch it would be much better to start  
>>> with
>>> 0.8 rc5. Creating new mart is as simple as choosing one or more main
>>> tables in the source schema. You can choose different tables and
>>> create different datasets. There is some documentation about it in
>>> rc5. If you want to know how the transformation algorithm works I  
>>> can
>>> describe that to you too
>>>
>>>
>>> a
>>>
>>>
>>>
>>> On 2011-03-12, at 12:53, "Andrea Edwards"<edwardsa at cs.man.ac.uk>
>>> wrote:
>>>
>>>> ok - thanks
>>>>
>>>> i don't know much about biomart as you can probably tell but i was
>>>> told
>>>> there are quite significant differences between 0.7 and 0.8.
>>>> If i am interested in understanding how the schema transformations
>>>> take
>>>> place so that I can design my own mart and integrate it with  
>>>> existing
>>>> marts, would i be better dropping back to 0.7? I'm keen to get a
>>>> mart up
>>>> and running very soon.
>>>>
>>>> On 12/03/2011 17:41, Arek Kasprzyk wrote:
>>>>> 0.8 rc 5 has still only rudimentary support for the MBuilder
>>>>> component. You will not be able to read 0.7 mbuilder XML with it.
>>>>> (ccing junjun who  has just taken over the coordination of the
>>>>> BioMart
>>>>> development to let him know that such discussions are taking  
>>>>> place)
>>>>>
>>>>> a
>>>>>
>>>>>
>>>>>
>>>>> On 2011-03-12, at 12:28, "Andrea Edwards"<edwardsa at cs.man.ac.uk>
>>>>> wrote:
>>>>>
>>>>>> Brilliant - thanks for such a prompt reply.
>>>>>>
>>>>>> I note that you say MBuilder (0.7) whereas i have checked out the
>>>>>> code
>>>>>> for biomart 0.8 rc4
>>>>>>
>>>>>>
>>>>>> On 12/03/2011 16:39, Arek Kasprzyk wrote:
>>>>>>> Hi Andrea
>>>>>>> All the transformation information is stored in the XML file  
>>>>>>> that
>>>>>>> MBuilder (0.7) uses to compile it's DDL for Ensembl core
>>>>>>> databases. I
>>>>>>> am sure the ensembl mart team will be happy to provide you the
>>>>>>> latest
>>>>>>> version
>>>>>>>
>>>>>>> a
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2011-03-12, at 11:15, "Andrea Edwards"<edwardsa at cs.man.ac.uk>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello
>>>>>>>>
>>>>>>>> I was wondering if there were any documents showing how the
>>>>>>>> ensembl
>>>>>>>> marts were created from the main ensembl databases.
>>>>>>>> Specifically i
>>>>>>>> was
>>>>>>>> hoping there were documents describing what tables were  
>>>>>>>> selected
>>>>>>>> as
>>>>>>>> main
>>>>>>>> tables for the marts and how the dimension tables were mapped  
>>>>>>>> to
>>>>>>>> the
>>>>>>>> main tables.
>>>>>>>>
>>>>>>>> As an example the ensembl_mart_61 contains a main table for  
>>>>>>>> human
>>>>>>>> named
>>>>>>>> translation_main (this is an abbreviation of the name but its
>>>>>>>> obvious
>>>>>>>> which one i mean) and this has a field called
>>>>>>>> protein_feature_prints_bool which is essentially a boolean  
>>>>>>>> field
>>>>>>>> indicating whether a protein translation is assocated with a  
>>>>>>>> row
>>>>>>>> in
>>>>>>>> the
>>>>>>>> PRINTS dimension table protein_feature_prints_dm. If the
>>>>>>>> translation
>>>>>>>> does have a row in this dimension table then I am guessing it
>>>>>>>> has a
>>>>>>>> PRINTS domain in it!
>>>>>>>>
>>>>>>>> The core database itself however has a table called translation
>>>>>>>> which
>>>>>>>> represents, well, a translation. Translations are linked to  
>>>>>>>> rows
>>>>>>>> in a
>>>>>>>> table called 'protein_feature' which in turn has a foreign key
>>>>>>>> called
>>>>>>>> analysis_id which links to an 'analysis' table with fields
>>>>>>>> 'database'
>>>>>>>> and 'program'. So in this schema, a translation is associated
>>>>>>>> with a
>>>>>>>> PRINTS annotation if it is linked to a 'protein_feature' record
>>>>>>>> which is
>>>>>>>> in turn linked to an 'analysis' record with the text 'PRINTS'
>>>>>>>> somewhere
>>>>>>>> in both/either the database/program fields.
>>>>>>>>
>>>>>>>> I am interested in how the biomart software is configured with
>>>>>>>> 'rules'
>>>>>>>> to create the mart schema from the database schema. Is there a
>>>>>>>> configuration file with these rules in that I could look at? Is
>>>>>>>> there a
>>>>>>>> worked example? As an academic exercise I'd like to recreate  
>>>>>>>> the
>>>>>>>> ensembl
>>>>>>>> marts. I have the biomart user manual but even with that  
>>>>>>>> document
>>>>>>>> I do
>>>>>>>> not know how to recreate the ensembl marts
>>>>>>>>
>>>>>>>> I am NOT specifically interested in protein domains. I used the
>>>>>>>> PRINTS
>>>>>>>> example purely for illustrative purposes as I thought it was a
>>>>>>>> straightforward example. I am interested in how you specify the
>>>>>>>> 'rules'
>>>>>>>> to get from a schema to a mart.
>>>>>>>>
>>>>>>>> thanks a lot
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list
>>>>>>>> Users at biomart.org
>>>>>>>> https://lists.biomart.org/mailman/listinfo/users
>
> _______________________________________________
> Users mailing list
> Users at biomart.org
> https://lists.biomart.org/mailman/listinfo/users
>
> _______________________________________________
> Users mailing list
> Users at biomart.org
> https://lists.biomart.org/mailman/listinfo/users

Rhoda Kinsella Ph.D.
Ensembl Bioinformatician,
European Bioinformatics Institute (EMBL-EBI),
Wellcome Trust Genome Campus,
Hinxton
Cambridge CB10 1SD,
UK.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20110314/486af005/attachment.html>


More information about the Dev mailing list