[ensembl-dev] Ortholog "scores" in Compara
Javier Herrero
jherrero at ebi.ac.uk
Sun Jul 10 18:50:38 BST 2011
Hi Patrick
The percentage identity is different depending on which protein you are looking
at as the percentage is calculated with respect to the length of that protein.
For instance is a protein is 100 aminoacids in length, its orthologous protein
in another species is 200 aminoacids long and the 100 aminoacids of the first
protein match perfectly half of the second protein, the percentage identities
will be:
- Perc identity 1 = 100 id. matches /100 aminoacids = 100%
- Perc identity 2 = 100 id. matches /200 aminoacids = 50%
Using the percentage identity to score the orthologs is not quite the right
thing to do, although it might work fine in many cases. The orthology
relationship is an evolutionary concept and should be look at using a
phylogenetic tree. The percentage identity only refers to sequence similarity.
Most of the times very similar proteins in different genomes will be
orthologous, but this is sometimes too simplistic. Imagine a gene that has
been duplicated in an ancestral species. You will have two copies (A and B) in
each genome and they can be very similar to one another. Strictly speaking,
copies A will be orthologous among themselves as well as copies B, but one A
will be paralogous to any other B.
I hope this helps
Javier
On Friday 08 Jul 2011 12:57:34 Patrick Meidl wrote:
> hi all,
>
> I'm trying to get all human <-> mouse orthologs from Compara, and would
> like to include some sort of score for the ortholog prediction in my
> dataset. reading the docs, I thought that perc_id of the Homology Member
> would be the best value to use; fromt the schema documentation of
> 'homology_member':
>
> perc_id int(10) YES NULL
> defines the percentage of identity between both homologues
>
> what puzzles me though is that for each pairwise homology, I get
> different perc_ids from the two members; the documentation suggests that
> you would expect the same value.
>
> here is a short example code snippet:
>
> --8<-----------------------------------------------------------------
>
> $ensembl_registry->load_registry_from_db(
> -host => 'ensembldb.ensembl.org',
> -user => 'anonymous',
> -port => 5306
> );
> my $homology_adaptor = Bio::EnsEMBL::Registry->get_adaptor(
> 'Multi', 'compara', 'Homology');
> my $mlss_adaptor = Bio::EnsEMBL::Registry->get_adaptor(
> 'Multi', 'compara', 'MethodLinkSpeciesSet');
>
> my $mlss = $mlss_adaptor->fetch_by_method_link_type_registry_aliases(
> 'ENSEMBL_ORTHOLOGUES', ['Homo sapiens', 'Mus musculus']);
> my $homologies =
> $homology_adaptor->fetch_all_by_MethodLinkSpeciesSet($mlss);
>
> foreach my $homology (@{ $homologies }) {
> print "-"x30, "\n";
> foreach my $member_attr (@{ $homology->get_all_Member_Attribute }) {
> my ($member, $attribute) = @{ $member_attr };
> print join("|", $member->stable_id, $attribute->perc_id), "\n";
> }
> }
>
> --8<-----------------------------------------------------------------
>
> so my question is: what does the perc_id mean? is it a good measure for
> the ortholog "score"? if not, what else should I use?
>
> cheers
>
> patrick
--
Javier Herrero, PhD
Ensembl Compara Project Leader
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus, Hinxton
Cambridge - CB10 1SD - UK
More information about the Dev
mailing list