[ensembl-dev] Metazoa GeneTrees as PhyloXML dumps

W. Augustine Dunn III wadunn83 at gmail.com
Fri Apr 20 17:14:02 BST 2012

Hello all:

I am writing to request clarification regarding the meaning of the
directory and file organization of the

The extracted directory contains 42 folders (named 000 to 041) and each dir
contains multiple xml files with rather cryptic names that start with
EMGT.  I will append the 'readme.phyloxml' text at the end of this message
for reference, but it was not so helpful in these regards.  I am naively
ASSUMING that EMGT stands for "Ensembl Metazoa Gene Tree" but would
appreciate your help on nailing that down as well.

I would like to instantiate the information included in your gene tree
analyses in my python scripts to allow me to map transcriptome level
expression similarities (as well as other OMICs type data) from Anopheles
gambiae, Aedes aegypti, and Culex quinquefasciatus onto these gene tree
relationships to help me parse out meaningful correlations between the
OMICs results across species.  I will feel much better when attempting to
do this if I have some better insight into the MEANING of the organization
of the trees.

I may have missed it, but scanning the supposedly relevant Ensembl
publications did not provide much help.

Can anyone shed some light onto this for me?

Thanks as always,



#### README ####

IMPORTANT: Please note you can download correlation data tables,
supported by Ensembl, via the highly customisable BioMart and
EnsMart data mining tools. See http://bacteria.ensembl.org/biomart or
http://www.ebi.ac.uk/biomart/ for more information.

Please send comments or questions to dev at ensembl.org.

PhyloXML GeneTree Flat File Dumps

PhyloXML (http://www.phyloxml.org/ and Pubmed ID 19860910) is an XML format
which is backed by an XMLSchema for validation purposes. Multiple parsers
are available for PhyloXML from numerous toolkits including BioPerl,
BioRuby, Forester (Java), Biopython and many more. The PhyloXML format also
allows for richer dumps allowing us to provide more information about a
gene tree in a single format.


The structure conforms to the standard PhyloXML structure apart from the
following rules and extensions

* A property is provided on clades called "Compara:dubious_duplication" in
order to flag nodes which have this same confidence rating in our database
* A property called "Compara:genome_db_name" is provided on every leaf node
to indicate the source of the peptide. In some cases taxonomy is a
redundant value
* All stable identifiers have the source of EnsemblGenomes even though the
true source may be a third party
* All sequences are CDNA alignments


