[ensembl-dev] [Ensembl-compara] Extracting orthologs proteins from multiple alignments

Thu May 30 22:46:30 BST 2013

Dear Benjamin

Depending on the alignment, the answer may be different. Each of the 
gene clusters that you extract actually map to a gene tree. In some 
cases, each human gene may be in single copy in its mammal sub-tree, and 
the trick is to partition the tree into sub-trees. In other cases, it 
may have a lineage-specific paralog, and there should be several human 
genes in the alignment.

In your specific example, the answer comes from the gene tree: 
http://www.ensembl.org/Homo_sapiens/Gene/Compara_Tree?collapse=none;db=core;g=ENSG00000020577;r=14:55033815-55260033
You can see the two mammal sub-trees. Here, you can split the list of 
genes in two groups, depending of their human orthologue. Each group 
would have a single human gene.
This is confirmed by the Orthologues view. Each human gene has 1-to-1 
relations to most of of the other mammal species.

I recommend getting the information from the gene trees directly. The 
EMF file contains all of them. You can traverse them from their root and 
look for the first non-duplication nodes that are below the last common 
ancestor of your 18 species. All the leaves of the sub-trees defined 
that way represent clusters of orthologues. There may still be cases 
with multiple human genes, but you are guaranteed that they arose after 
diverging from some other genes.

Hope this helps,
Matthieu

On 28/05/13 22:16, Benjamin Dubreuil wrote:
> Hi,
>
>
> I am trying to find a way for aligning all orthologs proteins from 18
> mammals species to human proteins.
>
> I've read the Gene Orthology/Paralogy prediction method
> <http://asia.ensembl.org/info/docs/compara/homology_method.html>.
>
> In the step 4 of this pipeline, for each cluster of genes (clustering
> based on their Blast scores), they've built multiple alignments using
> protein sequences .
>
>     /For each cluster, build a multiple alignment based on the protein
>     sequences using a combination of multiple aligners, consensified by
>     M-Coffee/
>
>
> All the aligments are available here in FASTA format
> <ftp://ftp.ensembl.org/pub/release-71/emf/ensembl-compara/homologies/Compara.71.protein.aa.fasta.gz>
> or in EMF format
> <ftp://ftp.ensembl.org/pub/release-71/emf/ensembl-compara/homologies/Compara.71.protein.aln.emf.gz>.
> I've downloaded those data. Then, I've filtered out all protein
> sequences that don't belong to one of the 18 mammals species on which
> I'm focused.
> Still, my problem remains now with the paralogs proteins. I can't get
> rid off those efficiently.
>
> For one of the alignment, I have those *mammal species GI|Human
> Orthologs GI*  (if Human Orthologs exists) :
>
> ENSBTAG00000001468|ENSG00000020577
> ENSBTAG00000009785|ENSG00000179134
> ENSCAFG00000005568|ENSG00000179134
> ENSCAFG00000014940|ENSG00000020577
> ENSECAG00000010681|ENSG00000020577
> ENSECAG00000015551|ENSG00000179134
> ENSFCAG00000010739|ENSG00000179134
> ENSG00000020577
> ENSG00000179134
> ENSGGOG00000000673|ENSG00000020577
> ENSGGOG00000024088|ENSG00000179134
> ENSLAFG00000011957|ENSG00000179134
> ENSLAFG00000013245|ENSG00000020577
> ENSMODG00000011669|ENSG00000020577
> ENSMODG00000013478|ENSG00000179134
> ENSMPUG00000005514|ENSG00000020577
> ENSMPUG00000017729|ENSG00000179134
> ENSMUSG00000021838|ENSG00000020577
> ENSMUSG00000037513|ENSG00000179134
> ENSOCUG00000000343|ENSG00000179134
> ENSOCUG00000003420|ENSG00000020577
> ENSOGAG00000009678|ENSG00000020577
> ENSOGAG00000032553|ENSG00000179134
> ENSPPYG00000005832|ENSG00000020577
> ENSPPYG00000009963|ENSG00000179134
> ENSPTRG00000006364|ENSG00000020577
> ENSPTRG00000010961|ENSG00000179134
> ENSPVAG00000011118|ENSG00000179134
> ENSPVAG00000016060|ENSG00000020577
> ENSRNOG00000010489|ENSG00000020577
> ENSRNOG00000019831|ENSG00000179134
> ENSSSCG00000010706|ENSG00000179134
> ENSSSCG00000016927|ENSG00000179134
> ENSSSCG00000023408|ENSG00000020577
> ENSSTOG00000003379|ENSG00000179134
> ENSSTOG00000015276|ENSG00000020577
> ENSTTRG00000003554|ENSG00000179134
> ENSTTRG00000008280|ENSG00000020577
>
> So I don't know which Human GI I should select (/ENSG00000179134/
> or///ENSG00000020577/).
> Should I split this alignment in two ?
>
> My final goal would be to have one human protein aligned with at least
> 10 orthologs proteins from a different species out of the 18 mammals
> species, which I'm studying.
>
> So I'm trying to find the best way to do it... Any suggestions ?
> Am i mistaking in the way of achieving it ?
>
> Best.
>
> Dubreuil Benjamin