[ensembl-dev] EnsEMBL compara / protein sequence alignments

Wed Nov 14 08:25:41 GMT 2012

Hi Javier,

Actually, we would like to obtain alignments of orthologous proteins isoforms.
These alignments would be used as entries for SIFT program.

After testing different methods (families...), I am still not sure  
about the best way to get these alignments with Ensembl...

I am thinking about starting from genes from my query specie. For each  
query specie gene, I would get all orthologs genes from my "hit  
species" and then get all their proteins. Then I would align these  
proteins to get orthologous proteins isoforms.

What do you think about this eventual method?

Best regards,

Sabrina.

Javier Herrero <jherrero at ebi.ac.uk> a écrit :

> BTW, we have an example script  
> (ensembl-compara/scripts/examples/families_workshop_fetchFamilyAlignment.pl)  
> that does something very similar to what you want (but just for one  
> gene).
>
> Javier
>
> On 07/11/12 11:25, Javier Herrero wrote:
>> Hi Sabrina
>>
>> It is certainly possible to get proteins from several species.
>>
>> If you are interested in getting alignments for all possible  
>> isoforms (each possible protein from each gene), you would have to  
>> use the Ensembl families. These are groups of similar proteins, but  
>> you should not assume that they are all orthologues. To infer  
>> orthology, you need a phylogenetic tree. The trees we provide are  
>> built using only one single representative protein per gene.
>>
>> In your case, I would recommend to use the Ensembl families, query  
>> the families using each cow (this is you query species, isn't it?)  
>> protein and dump the alignments. There are several options for  
>> this. You may want to use all possible species (the families are  
>> built using Ensembl and non-Ensembl proteins) or limit the  
>> alignment to a subset of species. Also, in some cases you will find  
>> that more than one cow proteins are in the same family, so you will  
>> get duplicated alignments. Is this OK?
>>
>> Kind regards
>>
>> Javier
>>
>> On 05/11/12 13:47, srodriguez wrote:
>>> Hi Javier,
>>>
>>> Thank you for your answer.
>>>
>>> Actually, I would like to obtain, 1 file per protein query aligned  
>>> to all other species ortholog proteins (and not 1 sequence to 1  
>>> sequence).
>>>
>>> ex:
>>> for protein ENSBTAP00000032594, the file containing:
>>> ENSBTAP00000032594/1-397  
>>> MDALRASAAKPPTGRKMKARAPPPPGKPATPNLHSGQRSPRRASPGPPQNQLSR
>>> ENSP00000265136/1-1261  
>>> MDAPRASAAKPPTGRKMKARAPPPPGKAATLHVHSDQKPPHDGALGSQQNLVRMK
>>> ENSSPECIE2...
>>> ENSSPECIE3...
>>>                         *** ***********************.** ::**.*:.*:  
>>> .: *. ** :
>>>
>>> Also, I would like to have 1 file per protein from the query, and  
>>> if a gene has several proteins, obtain all the proteins query as  
>>> single files with the alignment as above.
>>>
>>> Do you know if it is feasible to obtain such an output with  
>>> Ensembl compara?
>>>
>>> In that case, could you please modify the script to obtain it?
>>>
>>> Thank you very much in advance.
>>>
>>> Best regards,
>>>
>>> Sabrina.
>>>
>>>
>>>
>>>
>>>
>>>
>>> Javier Herrero <jherrero at ebi.ac.uk> a écrit :
>>>
>>>> Dear Sabrina
>>>>
>>>> I have modified the script slightly only. Essentially, I have  
>>>> removed some bits that were not required and cleaned up the code  
>>>> a little. I have also added the possibility of specifying the  
>>>> query and the target species in the command line. Last, I have  
>>>> also changed the script to output the alignments into separate  
>>>> files.
>>>>
>>>> Your strategy using the ENSEMBLGENE was correct. Indeed, you get  
>>>> two proteins aligned. I believe this is what you want, isn't it?
>>>>
>>>> I have added a few comments. Let me know if there something that  
>>>> is not clear.
>>>>
>>>> Javier
>>>>
>>>> On 22/10/12 15:58, srodriguez wrote:
>>>>> Dear all,
>>>>>
>>>>> I would like to use compara EnsEMBL API to get the aligned  
>>>>> protein sequences of a query animal with homologous protein  
>>>>> sequences from other species.
>>>>>
>>>>> The script would take as input the query specie name, (and if  
>>>>> possible the hit species names). The script would get the  
>>>>> proteins of the query organism, then the homologous protein  
>>>>> sequences, and then retrieves 1 file per protein query sequence  
>>>>> containing the alignment of the query (placed as the first  
>>>>> sequence) and then the other specie protein sequences aligned.
>>>>>
>>>>> I was thinking about using an "homology adaptor" with  
>>>>> ENSEMBLPEP, so I started a script that way, but I do not obtain  
>>>>> any results with ENSEMBLPEP and the results with ENSEMBLGENE are  
>>>>> 2 sequences per alignment (see script attached).
>>>>>
>>>>> I also tried with "families", but sometimes, I do not get the  
>>>>> protein sequence for my specie query in the sequence alignment  
>>>>> even though I searched by using my taxon id (script N#2 attached).
>>>>>
>>>>> Would you have a script that already performs my goal?
>>>>>
>>>>> If not, could you please help me reaching my goal?
>>>>>
>>>>> Thank you very much in advance.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Sabrina.
>>>>>
>>>>>
>>>>> *******************************************
>>>>> Sabrina Rodriguez
>>>>> Bioinformatics
>>>>> Département de Génétique animale
>>>>> Unité GABI
>>>>> Domaine de Vilvert
>>>>> 78532 Jouy en josas
>>>>>
>>>>> +33 (0) 1 34 65 29 53
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Dev mailing list Dev at ensembl.org
>>>>> Posting guidelines and subscribe/unsubscribe info:  
>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>> -- 
>>>> Javier Herrero, PhD
>>>> Ensembl Coordinator and Ensembl Compara Project Leader
>>>> European Bioinformatics Institute (EMBL-EBI)
>>>> Wellcome Trust Genome Campus, Hinxton
>>>> Cambridge - CB10 1SD - UK
>>>>
>>>>
>>>
>>>
>>>
>>>
>>> *******************************************
>>> Sabrina Rodriguez
>>> Bioinformatics
>>> Département de Génétique animale
>>> Unité GABI
>>> Domaine de Vilvert
>>> 78532 Jouy en josas
>>>
>>> +33 (0) 1 34 65 29 53
>>>
>>>
>>> _______________________________________________
>>> Dev mailing listDev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe  
>>> info:http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog:http://www.ensembl.info/
>>
>> -- 
>> Javier Herrero, PhD
>> Ensembl Coordinator and Ensembl Compara Project Leader
>> European Bioinformatics Institute (EMBL-EBI)
>> Wellcome Trust Genome Campus, Hinxton
>> Cambridge - CB10 1SD - UK
>
> -- 
> Javier Herrero, PhD
> Ensembl Coordinator and Ensembl Compara Project Leader
> European Bioinformatics Institute (EMBL-EBI)
> Wellcome Trust Genome Campus, Hinxton
> Cambridge - CB10 1SD - UK
>
>

*******************************************
Sabrina Rodriguez
Bioinformatics
Département de Génétique animale
Unité GABI
Domaine de Vilvert
78532 Jouy en josas

+33 (0) 1 34 65 29 53