[ensembl-dev] [Ensembl-compara] Extracting orthologs proteins from multiple alignments
Matthieu Muffato
muffato at ebi.ac.uk
Thu May 30 22:46:30 BST 2013
Dear Benjamin
Depending on the alignment, the answer may be different. Each of the
gene clusters that you extract actually map to a gene tree. In some
cases, each human gene may be in single copy in its mammal sub-tree, and
the trick is to partition the tree into sub-trees. In other cases, it
may have a lineage-specific paralog, and there should be several human
genes in the alignment.
In your specific example, the answer comes from the gene tree:
http://www.ensembl.org/Homo_sapiens/Gene/Compara_Tree?collapse=none;db=core;g=ENSG00000020577;r=14:55033815-55260033
You can see the two mammal sub-trees. Here, you can split the list of
genes in two groups, depending of their human orthologue. Each group
would have a single human gene.
This is confirmed by the Orthologues view. Each human gene has 1-to-1
relations to most of of the other mammal species.
I recommend getting the information from the gene trees directly. The
EMF file contains all of them. You can traverse them from their root and
look for the first non-duplication nodes that are below the last common
ancestor of your 18 species. All the leaves of the sub-trees defined
that way represent clusters of orthologues. There may still be cases
with multiple human genes, but you are guaranteed that they arose after
diverging from some other genes.
Hope this helps,
Matthieu
On 28/05/13 22:16, Benjamin Dubreuil wrote:
> Hi,
>
>
> I am trying to find a way for aligning all orthologs proteins from 18
> mammals species to human proteins.
>
> I've read the Gene Orthology/Paralogy prediction method
> <http://asia.ensembl.org/info/docs/compara/homology_method.html>.
>
> In the step 4 of this pipeline, for each cluster of genes (clustering
> based on their Blast scores), they've built multiple alignments using
> protein sequences .
>
> /For each cluster, build a multiple alignment based on the protein
> sequences using a combination of multiple aligners, consensified by
> M-Coffee/
>
>
> All the aligments are available here in FASTA format
> <ftp://ftp.ensembl.org/pub/release-71/emf/ensembl-compara/homologies/Compara.71.protein.aa.fasta.gz>
> or in EMF format
> <ftp://ftp.ensembl.org/pub/release-71/emf/ensembl-compara/homologies/Compara.71.protein.aln.emf.gz>.
> I've downloaded those data. Then, I've filtered out all protein
> sequences that don't belong to one of the 18 mammals species on which
> I'm focused.
> Still, my problem remains now with the paralogs proteins. I can't get
> rid off those efficiently.
>
> For one of the alignment, I have those *mammal species GI|Human
> Orthologs GI* (if Human Orthologs exists) :
>
> ENSBTAG00000001468|ENSG00000020577
> ENSBTAG00000009785|ENSG00000179134
> ENSCAFG00000005568|ENSG00000179134
> ENSCAFG00000014940|ENSG00000020577
> ENSECAG00000010681|ENSG00000020577
> ENSECAG00000015551|ENSG00000179134
> ENSFCAG00000010739|ENSG00000179134
> ENSG00000020577
> ENSG00000179134
> ENSGGOG00000000673|ENSG00000020577
> ENSGGOG00000024088|ENSG00000179134
> ENSLAFG00000011957|ENSG00000179134
> ENSLAFG00000013245|ENSG00000020577
> ENSMODG00000011669|ENSG00000020577
> ENSMODG00000013478|ENSG00000179134
> ENSMPUG00000005514|ENSG00000020577
> ENSMPUG00000017729|ENSG00000179134
> ENSMUSG00000021838|ENSG00000020577
> ENSMUSG00000037513|ENSG00000179134
> ENSOCUG00000000343|ENSG00000179134
> ENSOCUG00000003420|ENSG00000020577
> ENSOGAG00000009678|ENSG00000020577
> ENSOGAG00000032553|ENSG00000179134
> ENSPPYG00000005832|ENSG00000020577
> ENSPPYG00000009963|ENSG00000179134
> ENSPTRG00000006364|ENSG00000020577
> ENSPTRG00000010961|ENSG00000179134
> ENSPVAG00000011118|ENSG00000179134
> ENSPVAG00000016060|ENSG00000020577
> ENSRNOG00000010489|ENSG00000020577
> ENSRNOG00000019831|ENSG00000179134
> ENSSSCG00000010706|ENSG00000179134
> ENSSSCG00000016927|ENSG00000179134
> ENSSSCG00000023408|ENSG00000020577
> ENSSTOG00000003379|ENSG00000179134
> ENSSTOG00000015276|ENSG00000020577
> ENSTTRG00000003554|ENSG00000179134
> ENSTTRG00000008280|ENSG00000020577
>
> So I don't know which Human GI I should select (/ENSG00000179134/
> or///ENSG00000020577/).
> Should I split this alignment in two ?
>
> My final goal would be to have one human protein aligned with at least
> 10 orthologs proteins from a different species out of the 18 mammals
> species, which I'm studying.
>
> So I'm trying to find the best way to do it... Any suggestions ?
> Am i mistaking in the way of achieving it ?
>
> Best.
>
> Dubreuil Benjamin
More information about the Dev
mailing list