[ensembl-dev] taxonomy database

Thu Aug 28 15:24:00 BST 2014

Great! Thank you Matthieu,
Lel

On 08/28/2014 02:46 PM, Matthieu Muffato wrote:
> Hi Lel,
>
> On 28/08/14 13:25, Lel Eory wrote:
>> Hello Developers,
>>
>> Is it correct that the ncbi_taxonomy database generated via the
>> Bio::EnsEMBL::Compara::PipeConfig::ImportNCBItaxonomy_conf pipeline?
>
> Yes, it is the correct config file.
>
>> Just to get the ncbi_taxonomy database would I need to add anything to
>> the ensembl-compara/scripts/taxonomy/ensembl_aliases.sql file (i.e. for
>> running the ImportNCBItaxonomy_conf pipeline)?
>
> You'll get a working taxonomy database even without editing the 
> ensembl_aliases.sql file (see below for details about this file)
>
>> Also, can I update the database, or do I need to drop the "old" database
>> and re-run the pipeline when I want to get the most recent taxonomy
>> information?
>
> The pipeline will generally create a new database with the eHive 
> schema and the ncbi tables. Usually, we run it that and once 
> everything is done and checked, we only copy over the two ncbi tables 
> to our usual database to replace the old copy.
>
>> For the compara master database I need to define the 'ensembl alias
>> name' and 'ensembl timetree mya' fields in ensembl_aliasese.sql file.
>> Should this file only contain this information on species (and
>> ancestrals) which are relevant for the compara pipelines? Will only the
>> relevant information be loaded?
>
> The "ensembl timetree mya" fields are only needed if you want to run 
> our CAFE pipeline to detect gene family expansions / contractions. The 
> information is also shown on the website, in the zmenus of the 
> gene-tree view, but I think the web-code will simply skip the field if 
> it's missing.
> The "ensembl alias name" fields are only used on the website on the 
> gene-tree view. The view will use the scientific names if the alias is 
> missing
>
> If you want both of those, you have to define the alias for each 
> extant species, and both the alias and the divergence time for each 
> internal node (ancestral taxon), but you can also skip this step if 
> you don't need CAFE data / web polishings.
>
>> Is there a script I can use to get the 'ensembl timetree mya' values, or
>> is it done manually from www.timetree.org?
>
> Hum, kind-of. Because we only add a handful of species every release, 
> we don't have a script to do a bulk import from timetree (their terms 
> and conditions forbid "large-scale data-mining" anyway), but we have a 
> script that will take one taxon_id and report where it should be 
> inserted in the species tree (the taxon_id of the internal node) and 
> the divergence time (by querying and grepping timetree). This is not 
> on the default branch of the ensembl-compara repo, but you can still 
> find it on github: 
> https://github.com/Ensembl/ensembl-compara/blob/feature/hmm_classification/modules/Bio/EnsEMBL/Compara/Utils/SpeciesTree.pm#L191 
>
> You could run this script multiple times
>
> Hope this helps,
> Matthieu
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: 
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.