[ensembl-dev] Ortholog "scores" in Compara

Sun Jul 10 18:50:38 BST 2011

Hi Patrick

The percentage identity is different depending on which protein you are looking 
at as the percentage is calculated with respect to the length of that protein. 
For instance is a protein is 100 aminoacids in length, its orthologous protein 
in another species is 200 aminoacids long and the 100 aminoacids of the first 
protein match perfectly half of the second protein, the percentage identities 
will be:
- Perc identity 1 = 100 id. matches /100 aminoacids = 100%
- Perc identity 2 = 100 id. matches /200 aminoacids = 50%

Using the percentage identity to score the orthologs is not quite the right 
thing to do, although it might work fine in many cases. The orthology 
relationship is an evolutionary concept and should be look at using a 
phylogenetic tree. The percentage identity only refers to sequence similarity. 

Most of the times very similar proteins in different genomes will be 
orthologous, but this is sometimes too simplistic. Imagine a gene that has 
been duplicated in an ancestral species. You will have two copies (A and B) in 
each genome and they can be very similar to one another. Strictly speaking, 
copies A will be orthologous among themselves as well as copies B, but one A 
will be paralogous to any other B.

I hope this helps

Javier

On Friday 08 Jul 2011 12:57:34 Patrick Meidl wrote:
> hi all,
> 
> I'm trying to get all human <-> mouse orthologs from Compara, and would
> like to include some sort of score for the ortholog prediction in my
> dataset. reading the docs, I thought that perc_id of the Homology Member
> would be the best value to use; fromt the schema documentation of
> 'homology_member':
> 
> perc_id     int(10)     YES         NULL
>     defines the percentage of identity between both homologues
> 
> what puzzles me though is that for each pairwise homology, I get
> different perc_ids from the two members; the documentation suggests that
> you would expect the same value.
> 
> here is a short example code snippet:
> 
> --8<-----------------------------------------------------------------
> 
> $ensembl_registry->load_registry_from_db(
>   -host => 'ensembldb.ensembl.org',
>   -user => 'anonymous',
>   -port => 5306
> );
> my $homology_adaptor = Bio::EnsEMBL::Registry->get_adaptor(
>   'Multi', 'compara', 'Homology');
> my $mlss_adaptor = Bio::EnsEMBL::Registry->get_adaptor(
>   'Multi', 'compara', 'MethodLinkSpeciesSet');
> 
> my $mlss = $mlss_adaptor->fetch_by_method_link_type_registry_aliases(
>   'ENSEMBL_ORTHOLOGUES', ['Homo sapiens', 'Mus musculus']);
> my $homologies =
> $homology_adaptor->fetch_all_by_MethodLinkSpeciesSet($mlss);
> 
> foreach my $homology (@{ $homologies }) {
>   print "-"x30, "\n";
>   foreach my $member_attr (@{ $homology->get_all_Member_Attribute }) {
>     my ($member, $attribute) = @{ $member_attr };
>     print join("|", $member->stable_id, $attribute->perc_id), "\n";
>   }
> }
> 
> --8<-----------------------------------------------------------------
> 
> so my question is: what does the perc_id mean? is it a good measure for
> the ortholog "score"? if not, what else should I use?
> 
> cheers
> 
>     patrick

-- 
Javier Herrero, PhD
Ensembl Compara Project Leader
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus, Hinxton
Cambridge - CB10 1SD - UK