[ensembl-dev] database model and API versions

Matthieu Muffato muffato at ebi.ac.uk
Tue Apr 28 10:50:45 BST 2015


The assemblies and gene sets haven't changed. We haven't recomputed any 
whole-genome alignments / gene trees on the GRCh37 site.

There is one exception to this: the Gene gain/loss analysis was missing 
some data in the initial release 75 of Ensembl (the last one on the 
GRCh37 assembly). it has since been fixed on both the main site and the 
GRCh37 site.

Matthieu

On 28/04/15 10:44, Will Chow wrote:
> Just a curiosity question, on 3337, what is being updated?  Is it just
> schema changes?  I guess with the static organism databases used for
> compara, compara itself doesn’t require any update, unless there are
> updates to the human gene build affecting maybe gene trees?
>
> thanks.
>
> Will
>
>
> On Apr 28, 2015, at 10:25 AM, mag <mr6 at ebi.ac.uk <mailto:mr6 at ebi.ac.uk>>
> wrote:
>
>> Hi Duarte,
>>
>> The VEP --assembly flag uses the solution I suggested initially, which
>> is to have the two databases on two separate servers.
>> By specifying --assembly GRCh37, the default 3306 port is replaced by
>> the 3337 port, which is where the GRCh37 databases are hosted.
>> It is worth noting as well that only the human databases on port 3337
>> are updated, all the other databases are identical to the ones from
>> release 75.
>>
>> The current implementation of the registry does not support two core
>> databases for a single species on the same server.
>>
>> The solutions are:
>> - use two separate servers
>> In the case of our live servers, we have ensembldb.ensembl.org
>> <http://ensembldb.ensembl.org>:3306 for GRCh38 and ensembl.ensembl.org
>> <http://ensembl.ensembl.org>:3337 for GRCh37
>> - bypass the registry and specify each required database individually
>> This will only work if connecting to one database at a time
>>
>> Our system is currently in transition between two models.
>> Historically, one species has one assembly at one given time.
>> With the migration from GRCh37 to GRCh38 and the future of genomics,
>> we see the need to support multiple assemblies for a single species.
>> We are currently working on better solutions for this.
>>
>>
>> Regards,
>> Magali
>>
>> On 28/04/2015 09:51, Duarte Molha wrote:
>>> Ok ... thanks Magali.
>>>
>>> I believe the latest VEP now supports an --assembly flag to allow it
>>> to annotate against a specific assembly.
>>> Can we not have the same flag on the registry ?
>>> How does VEP do it?
>>> This would be incredibly useful because I would not have to  create
>>> new scripts to support a different assembly.
>>>
>>> I could just download all the tables and just tell the registry which
>>> one to use.
>>>
>>> Please correct me if I am wrong but your proposed solution would mean
>>> I would have to bypass the registry completely and I would need to
>>> create each each adaptor from scratch and thus I would need to alter
>>> a lot of my scripts to support both assemblies.
>>>
>>> "--assembly GRCh37" would be a much more preferable route.
>>>
>>> Best regards,
>>>
>>>     Duarte
>>>
>>>
>>>
>>>
>>> =========================
>>>      Duarte Miguel Paulo Molha
>>> http://about.me/duarte
>>> =========================
>>>
>>> On 27 April 2015 at 17:48, mag <mr6 at ebi.ac.uk <mailto:mr6 at ebi.ac.uk>>
>>> wrote:
>>>
>>>     Hi Duarte,
>>>
>>>     The mysql dumps for GRCh37 are available on the ftp site as well
>>>     ftp://ftp.ensembl.org/pub/grch37/release-79/mysql/
>>>
>>>     I would recommend having only one copy of human for release 79.
>>>     So if you are interested in the GRCh37 data, you can download the
>>>     database from ftp://ftp.ensembl.org/pub/grch37/release-79/mysql/
>>>     rather than ftp://ftp.ensembl.org/pub/release-79/mysql/
>>>
>>>     If you need both databases on the same server, you can access a
>>>     given database directly rather than using the registry.
>>>     my $human_dba = Bio::EnsEMBL::DBSQL::DBAdaptor->new(
>>>         -HOST => 'localhost',
>>>         -PORT => 3306,
>>>         -USER => 'user',
>>>         -DBNAME => 'homo_sapiens_core_79_37',
>>>         -SPECIES => 'homo_sapiens',
>>>         -GROUP => 'core'
>>>     );
>>>
>>>
>>>     Hope that helps,
>>>     Magali
>>>
>>>     On 27/04/2015 17:11, Duarte Molha wrote:
>>>>     Thanks Magali
>>>>
>>>>     But I think you have not understtod my question.
>>>>
>>>>     Assume I want to download the databases to my local computer and
>>>>     use the perl API 79 to query the latest 79_37 database instead
>>>>     of the default 79_38.
>>>>     Previously, I just had to download the mysql tables
>>>>     corresponding to the api I was using to fetch the data
>>>>     correctly, however, now you have broken that link
>>>>     API->Underlying_assembly
>>>>
>>>>     So how do I tell my scripts what database to query?
>>>>
>>>>     My local sql will all be under the same port.
>>>>
>>>>     Best regards
>>>>
>>>>     Duarte
>>>>
>>>>
>>>>
>>>>     =========================
>>>>          Duarte Miguel Paulo Molha
>>>>     http://about.me/duarte
>>>>     =========================
>>>>
>>>>     On 27 April 2015 at 17:01, mag <mr6 at ebi.ac.uk
>>>>     <mailto:mr6 at ebi.ac.uk>> wrote:
>>>>
>>>>         Hi Duarte,
>>>>
>>>>         The archive 75 website is still based on the release 75 API.
>>>>
>>>>         For the dedicated GRCh37 website though, we have used a data
>>>>         freeze from release 75 and have since been updating the
>>>>         website and underlying databases along with the main release.
>>>>         The GRCh37 databases are available on our main MySQL server
>>>>         on port 3337 (instead of the default 3306 which will give
>>>>         you access to GRCh38 databases)
>>>>
>>>>
>>>>         Hope that helps,
>>>>         Magali
>>>>
>>>>
>>>>         On 27/04/2015 16:56, Duarte Molha wrote:
>>>>>         Dear developers
>>>>>
>>>>>         On your GRCh37 archive site you say this:
>>>>>
>>>>>         ===========================
>>>>>
>>>>>
>>>>>             About this archive
>>>>>
>>>>>         This archive is based on Ensembl Release 75 data, and gives
>>>>>         continuing access to human assembly GRCh37, as well as all
>>>>>         our other release 75 species (data freeze March 2014) for
>>>>>         comparative purposes. Human variation and regulation data
>>>>>         has since been updated in March 2015.
>>>>>
>>>>>         The API and website will be updated in tandem with the
>>>>>         release of the main Ensembl website (*currently version
>>>>>         79*), and we will also periodically update this site with
>>>>>         new data human, which will be announced in this panel.
>>>>>
>>>>>         MySQL dumps of human databases on the most recent schema
>>>>>         version are available on ourFTP site
>>>>>         <ftp://ftp.ensembl.org/pub/grch37/>.
>>>>>
>>>>>         =========================
>>>>>
>>>>>         It was my understanding that an API version was directly
>>>>>         linked to a specific assembly. So I thought that if I
>>>>>         wanted to query the latest GRCh37 assembly I would need to
>>>>>         use the api v75 and if I wanted to use a local database, I
>>>>>         would download the corresponding sql tables for that version.
>>>>>
>>>>>         However, according to this announcement, I can now use the
>>>>>         V79 api and query the old assembly... How is this
>>>>>         accomplished ?
>>>>>         What you I have to do on my scripts to make sure I am
>>>>>         querying the 37 version even though I am using the latest API?
>>>>>
>>>>>         Sorry, I hope this is not a stupid question but I am a bit
>>>>>         confused.
>>>>>
>>>>>         Best regards
>>>>>
>>>>>         Duarte
>>>>>

-- 
Matthieu Muffato, Ph.D.
Ensembl Compara and TreeFam Project Leader
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus, Hinxton
Cambridge, CB10 1SD, United Kingdom
Room  A3-145
Phone + 44 (0) 1223 49 4631
Fax   + 44 (0) 1223 49 4468




More information about the Dev mailing list