[ensembl-dev] TreeBEST-compatible species tree

Thu Dec 5 10:22:12 GMT 2019

Hi Greg

When we run our pipelines, e.g. TreeBest, we use internal node 
identifiers instead of species names. It's shorter and less ambiguous to 
parse. It's all part of the pipeline, but you can adapt the 
ensembl-compara/scripts/examples/species_getSpeciesTree.pl script to get 
it. Replace the string %{n} with %{o}%{-E"*"} .

"o" tells the formatter to use node IDs, and the star character must be 
added for TreeBest not to penalise the species. Initially this was meant 
to cater for the "low-coverage" mammals assembly, but in our tests a few 
years ago it didn't seem useful any more, so we flag everything as 
"fully sequenced" by adding the star character.

This script is not compatible with the e78 API / database schema, but I 
think you can use the latest species-tree and prune it. For 
protein-trees we use the NCBI taxonomy, and I don't think it's changed 
much (I remember the relationship between birds, turtles and other 
reptiles changed at some point, but can't remember when)

Hope this helps,
Matthieu

On 03/12/2019 11:05, Greg Slodkowicz wrote:
> Dear developers,
> I was wondering if there is a way to access the species tree that is 
> used for running TreeBEST? I have the ’normal’ Newick species tree 
> from the Compara GitHub repository but it seems like TreeBEST is quite 
> picky about the labelling of the tree nodes.
>
> What I would like to do is re-run gene-species tree reconciliation for 
> a few gene trees of interest, get bootstrap replicates for that tree 
> and then run some downstream analysis on them to get a measure of 
> uncertainty introduced by the differences in the tree.I would ideally 
> like the species tree from an archival (release 78) version of Ensembl 
> (though I imagine it hasn’t changed that much).
>
> Many thanks,
> Greg