[ensembl-dev] taxonomy database

Thu Aug 28 14:46:18 BST 2014

Hi Lel,

On 28/08/14 13:25, Lel Eory wrote:
> Hello Developers,
>
> Is it correct that the ncbi_taxonomy database generated via the
> Bio::EnsEMBL::Compara::PipeConfig::ImportNCBItaxonomy_conf pipeline?

Yes, it is the correct config file.

> Just to get the ncbi_taxonomy database would I need to add anything to
> the ensembl-compara/scripts/taxonomy/ensembl_aliases.sql file (i.e. for
> running the ImportNCBItaxonomy_conf pipeline)?

You'll get a working taxonomy database even without editing the 
ensembl_aliases.sql file (see below for details about this file)

> Also, can I update the database, or do I need to drop the "old" database
> and re-run the pipeline when I want to get the most recent taxonomy
> information?

The pipeline will generally create a new database with the eHive schema 
and the ncbi tables. Usually, we run it that and once everything is done 
and checked, we only copy over the two ncbi tables to our usual database 
to replace the old copy.

> For the compara master database I need to define the 'ensembl alias
> name' and 'ensembl timetree mya' fields in ensembl_aliasese.sql file.
> Should this file only contain this information on species (and
> ancestrals) which are relevant for the compara pipelines? Will only the
> relevant information be loaded?

The "ensembl timetree mya" fields are only needed if you want to run our 
CAFE pipeline to detect gene family expansions / contractions. The 
information is also shown on the website, in the zmenus of the gene-tree 
view, but I think the web-code will simply skip the field if it's missing.
The "ensembl alias name" fields are only used on the website on the 
gene-tree view. The view will use the scientific names if the alias is 
missing

If you want both of those, you have to define the alias for each extant 
species, and both the alias and the divergence time for each internal 
node (ancestral taxon), but you can also skip this step if you don't 
need CAFE data / web polishings.

> Is there a script I can use to get the 'ensembl timetree mya' values, or
> is it done manually from www.timetree.org?

Hum, kind-of. Because we only add a handful of species every release, we 
don't have a script to do a bulk import from timetree (their terms and 
conditions forbid "large-scale data-mining" anyway), but we have a 
script that will take one taxon_id and report where it should be 
inserted in the species tree (the taxon_id of the internal node) and the 
divergence time (by querying and grepping timetree). This is not on the 
default branch of the ensembl-compara repo, but you can still find it on 
github: 
https://github.com/Ensembl/ensembl-compara/blob/feature/hmm_classification/modules/Bio/EnsEMBL/Compara/Utils/SpeciesTree.pm#L191
You could run this script multiple times

Hope this helps,
Matthieu