[ensembl-dev] Question on compara gene trees

Matthieu Muffato muffato at ebi.ac.uk
Tue Nov 19 12:33:33 GMT 2013


Hi Sunita

The bootstrap support comes from 100 iterations of a Neighbour-Joining 
reconstruction, whereas the final gene tree is a merge of 3 NJ trees and 
2 PhyML trees (cf the documentation on our website). Therefore, you can 
find cases where a node has a bootstrap of 0 (if the merge algorithm 
selected it from PhyML trees only)

The pipeline does not have the concept of scoring values for homologies. 
However, we have levels of confidence (the possible_orthologs that 
Javier mentioned).
Please be aware that we're going to revisit the classification of the 
high/low confidence orthologues in e74. I'll send an email with more 
details closer to the release.

Regarding the dumps, I realise it is a bit cumbersome to parse the two 
files at the same time to link the newick trees to their alignements.
In the EMF dumps of the alignments, we will add a "TREE" token with the 
newick / nhx strings between the SEQ sections (which define the list of 
genes), and the DATA token (which introduces the alignment).
This should be live with e74 (eg21)

Best regards,
Matthieu

On 15/11/13 22:44, Javier Herrero (TGAC) wrote:
> Dear Sunita
>
> You can check the activity on the list in here:
> http://lists.ensembl.org/pipermail/dev/2013-November/thread.html#9446.
> You will see that your last two emails have been received correctly.
>
> I will try to answer some of your questions, but please have a look at
> this page:
> http://www.ensembl.org/info/genome/compara/homology_method.html where
> you will find a few details about the methodology used to build the
> phylogenetic trees.
>
> The trees are typically rooted using outgroups. This is done internally
> by TreeBeST, the software developed by the Heng Li (TreeFam) and
> currently used in Ensembl. The branch lengths represent an estimate of
> the number of mutations based on the back-translated alignment, using
> the HKY model in PHYML. Therefore, the trees are phylograms. As far as I
> remember, the bootstrap support comes from a 100 resampling replicates
> (i.e. Felsenstein 1985).
>
> The alignments are available in any of the other files in the same FTP
> directory. The file you have downloaded is smaller because it only lists
> the trees.
>
> As a general rule, the orthologs in Ensembl do not have a confidence
> value as of now. There is a low-confidence set of orthologs call
> “possible orthologs” which represents the closest homolog when no other
> ortholog is found. Please refer to the aforementioned URL for more
> details on this.
>
> Kind regards
>
> Javier
>
> On 15 Nov 2013, at 20:26, Kumari, Sunita <kumari at cshl.edu
> <mailto:kumari at cshl.edu>> wrote:
>
>> Hi Ensembl team,
>>
>> I will really appreciate if someone can answer my questions quickly.
>>
>> I did not get any response so far. I am not sure even if you are
>> getting my emails.
>>
>> Thanks much.
>>
>> Sunita
>>
>>
>>
>>
>> ========================
>>
>> From: Kumari, Sunita
>> Sent: Thursday, November 14, 2013 3:47 PM
>> To: dev at ensembl.org <mailto:dev at ensembl.org>
>> Subject: quick questions for gene trees
>>
>> Hi Ensembl compara team,
>>
>> I am using this ensemble ftp site to get alignment files and gene
>> trees in newick format:
>>
>> ftp://ftp.ensemblgenomes.org/pub/plants/release-20/emf/ensembl-compara/homologies/
>>
>> I am using  Compara.gene_trees.20.emf.gz and
>> Compara.newinck_trees.20.emf.gz files
>>
>> I have couple of questions. I would appreciate if you can please
>> provide me some information.
>>
>> 1. metadata information on gene trees:
>>
>> a) Are the trees outgroup OR midpoint rooted?
>>
>> b) The branch length unit is replacements per position, arbitrary
>> units or million years?
>>
>> c) Tree style is cladogram, phylogram, or phenogram?
>>
>> d) bootstrap type is felsenstein 1985, aLRT SH-like branch support, or
>> bayesian posterior probability?
>>
>>
>> 2. For alignments (Compara.gene_trees.20.emf.gz):
>>
>> Where can I get the alignment ID, i.e. the 'source DB alignment ID'?
>> e.g. What is the unique identifier for the alignment at the source
>> database?
>>
>>
>> 3. InParanoid7 provides scoring values to orthologs. e.g.
>> http://inparanoid.sbc.su.se/cgi-bin/e.cgi?species1=93&species2=98&clusters_per_page=50&.submit=Submit+Query&clusterlowerlimit=1
>>
>> Do we also provide scoring value to orthologs using Compara pipeline?
>> If not, any plan to provide this value in next release?
>>
>> Looking forward to your reply.
>>
>> Thanks.
>>
>> Sunita
>> ________________________________________
>>
>> Sunita Kumari, PhD
>> Bioinformatics Scientist,
>> Ware Lab,
>> Cold Spring Harbor Labs,
>> Cold Spring Harbor, NY -11724
>>
>> ________________________________________
>> From: Kumari, Sunita
>> Sent: Tuesday, November 12, 2013 3:37 PM
>> To: dev at ensembl.org
>> Subject: Question on compara gene trees
>>
>> Dear Ensembl compara team,
>>
>>
>> I have couple of questions on metadata for gene trees. I am using this
>> ensemble ftp site to get alignment files and gene trees in newick format:
>> ftp://ftp.ensemblgenomes.org/pub/plants/release-20/emf/ensembl-compara/homologies/
>>
>> Q1.  For each tree, can we get the following information; pl confirm
>> the answer given below each comment.
>>
>> a) If the tree is Outgroup_OR_Midpoint rooted;
>> -----Probably Outgroup
>>
>> b) branch_length        unit is "Replacements per position" OR
>> "Arbitrary units" OR "Million years";
>> ---Probably arbitrary
>>
>> c) tree style is "Cladogram" OR "Phylogram" OR "Phenogram";
>> -- Phylogram
>>
>> d) bootstrap_type       is "Felsenstein 1985" OR "aLRT SH-like branch
>> support" OR "Bayesian posterior probability"
>>
>> please provide the correct bootstrap type.
>>
>>
>> Q2. Is it possible to get conservation score in next compara release
>> for Ensembl plant genomes?
>> What will be the probable timeline to get scoring available?
>>
>>
>> Thanks.
>>
>> Sunita
>>
>> Sunita Kumari, PhD
>> Bioinformatics Scientist,
>> Ware Lab,
>> Cold Spring Harbor Labs,
>> Cold Spring Harbor, NY - 11724
>>




More information about the Dev mailing list