[ensembl-dev] taxonomy database
Lel Eory
lel.eory at roslin.ed.ac.uk
Thu Aug 28 15:24:00 BST 2014
Great! Thank you Matthieu,
Lel
On 08/28/2014 02:46 PM, Matthieu Muffato wrote:
> Hi Lel,
>
> On 28/08/14 13:25, Lel Eory wrote:
>> Hello Developers,
>>
>> Is it correct that the ncbi_taxonomy database generated via the
>> Bio::EnsEMBL::Compara::PipeConfig::ImportNCBItaxonomy_conf pipeline?
>
> Yes, it is the correct config file.
>
>> Just to get the ncbi_taxonomy database would I need to add anything to
>> the ensembl-compara/scripts/taxonomy/ensembl_aliases.sql file (i.e. for
>> running the ImportNCBItaxonomy_conf pipeline)?
>
> You'll get a working taxonomy database even without editing the
> ensembl_aliases.sql file (see below for details about this file)
>
>> Also, can I update the database, or do I need to drop the "old" database
>> and re-run the pipeline when I want to get the most recent taxonomy
>> information?
>
> The pipeline will generally create a new database with the eHive
> schema and the ncbi tables. Usually, we run it that and once
> everything is done and checked, we only copy over the two ncbi tables
> to our usual database to replace the old copy.
>
>> For the compara master database I need to define the 'ensembl alias
>> name' and 'ensembl timetree mya' fields in ensembl_aliasese.sql file.
>> Should this file only contain this information on species (and
>> ancestrals) which are relevant for the compara pipelines? Will only the
>> relevant information be loaded?
>
> The "ensembl timetree mya" fields are only needed if you want to run
> our CAFE pipeline to detect gene family expansions / contractions. The
> information is also shown on the website, in the zmenus of the
> gene-tree view, but I think the web-code will simply skip the field if
> it's missing.
> The "ensembl alias name" fields are only used on the website on the
> gene-tree view. The view will use the scientific names if the alias is
> missing
>
> If you want both of those, you have to define the alias for each
> extant species, and both the alias and the divergence time for each
> internal node (ancestral taxon), but you can also skip this step if
> you don't need CAFE data / web polishings.
>
>> Is there a script I can use to get the 'ensembl timetree mya' values, or
>> is it done manually from www.timetree.org?
>
> Hum, kind-of. Because we only add a handful of species every release,
> we don't have a script to do a bulk import from timetree (their terms
> and conditions forbid "large-scale data-mining" anyway), but we have a
> script that will take one taxon_id and report where it should be
> inserted in the species tree (the taxon_id of the internal node) and
> the divergence time (by querying and grepping timetree). This is not
> on the default branch of the ensembl-compara repo, but you can still
> find it on github:
> https://github.com/Ensembl/ensembl-compara/blob/feature/hmm_classification/modules/Bio/EnsEMBL/Compara/Utils/SpeciesTree.pm#L191
>
> You could run this script multiple times
>
> Hope this helps,
> Matthieu
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the Dev
mailing list