[ensembl-dev] Projecting protein names and GO terms across species

alc alc at sanger.ac.uk
Thu Mar 13 09:17:08 GMT 2014


 

Dear Ensembl developers and users, 

I'm involved in some helminth genome sequencing projects in my group,
and my colleague (Eleanor Stanley) has built an-house Compara database
for these genomes, from which we have inferred orthologs. 

I'm planning to to project protein names and GO terms across species. I
know that the Ensembl team do this already, but can't find many details
of how it's done on the web. 

I'm wondering whether my plan is very different from the Ensembl one,
here is what I'm thinking of doing: 

(i) Projecting protein names: for each gene in a query species (eg.
Strongyloides ratti), identify its one-to-one and many-S.ratti-to-one
orthologs in C. elegans, S. mansoni, human, D. melanogaster, zebrafish
in our local Compara database. Take a protein name from a curated
UniProt entry for one of these orthologs (taking orthologs from those
species in order of preference given above), and project it to the query
gene. Give the projected protein name evidence code ECO:0000265 and give
the UniProt accession of the source protein. If the same protein name is
projected to several query genes, then number then with Arabic numerals,
as described in the UniProt protein naming guide
www.uniprot.org/docs/nameprot I couldn't find much information on the
web about how Ensembl project protein names so am wondering is this very
different? 

(ii) Projecting GO terms: for each gene in a query species (eg.
Strongyloides ratti), identify all its orthologs (one-to-one,
one-to-many, many-to-one, many-to-many) in C. elegans, S. mansoni,
human, D. melanogaster, zebrafish in our local Compara database. Take
manually curated GO terms of types IDA/IEP/IGI/IMP/IPI (excluding
'protein binding') from the orthologs. For each pair of ortholog genes
from two different species, find the last common ancestors of their GO
terms in the GO hierarchy: project these ancestral GO terms to the query
gene. Do this for each pair of ortholog genes from two different
species. Give the projected GO terms evidence code 'IEA' and give the
UniProt accessions of the source proteins. [Note: by transferring the
last common ancestors of GO terms from orthologs from two different
species, I hope to be conservative and just project GO terms that are
likely to be conserved across species.] I found some information on how
Ensembl project GO terms on the web (http://www.ebi.ac.uk/GOA/
[1]compara_go_annotations [1]), but am not sure if the GO hierarchy is
used at all as in my idea, or if all GO terms are directly projected
from orthologs to the query gene? 

Is this very different to what the Ensembl team are doing? I would be
very grateful to hear of any differences. 

Kind Regards, 

Avril 

Avril Coghlan 

Parasite Genomics Team 

Sanger Institute 
 

Links:
------
[1] http://www.ebi.ac.uk/GOA/compara_go_annotations
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140313/a1daca39/attachment.html>


More information about the Dev mailing list