[ensembl-dev] EnsEMBL compara / protein sequence alignments

Thu Nov 15 02:34:18 GMT 2012

Hi Sabrina

I see. We have been considering using the family alignments for SIFT as 
well. You can get the alignments from the families, you don't need to 
re-align the proteins. Ensembl families contain very closely-related 
proteins.

The script you need to get the alignment from a family is fairly simple 
and we can help you with this. Is there any reason why you want to 
restrict your analysis to a given set of genomes? Would you consider 
using all proteins in the family (again, they are all very similar 
proteins)?

Javier

On 14/11/12 17:25, srodriguez wrote:
> Hi Javier,
>
> Actually, we would like to obtain alignments of orthologous proteins 
> isoforms.
> These alignments would be used as entries for SIFT program.
>
> After testing different methods (families...), I am still not sure 
> about the best way to get these alignments with Ensembl...
>
> I am thinking about starting from genes from my query specie. For each 
> query specie gene, I would get all orthologs genes from my "hit 
> species" and then get all their proteins. Then I would align these 
> proteins to get orthologous proteins isoforms.
>
> What do you think about this eventual method?
>
> Best regards,
>
> Sabrina.
>
>
> Javier Herrero <jherrero at ebi.ac.uk> a écrit :
>
>> BTW, we have an example script 
>> (ensembl-compara/scripts/examples/families_workshop_fetchFamilyAlignment.pl) 
>> that does something very similar to what you want (but just for one 
>> gene).
>>
>> Javier
>>
>> On 07/11/12 11:25, Javier Herrero wrote:
>>> Hi Sabrina
>>>
>>> It is certainly possible to get proteins from several species.
>>>
>>> If you are interested in getting alignments for all possible 
>>> isoforms (each possible protein from each gene), you would have to 
>>> use the Ensembl families. These are groups of similar proteins, but 
>>> you should not assume that they are all orthologues. To infer 
>>> orthology, you need a phylogenetic tree. The trees we provide are 
>>> built using only one single representative protein per gene.
>>>
>>> In your case, I would recommend to use the Ensembl families, query 
>>> the families using each cow (this is you query species, isn't it?) 
>>> protein and dump the alignments. There are several options for this. 
>>> You may want to use all possible species (the families are built 
>>> using Ensembl and non-Ensembl proteins) or limit the alignment to a 
>>> subset of species. Also, in some cases you will find that more than 
>>> one cow proteins are in the same family, so you will get duplicated 
>>> alignments. Is this OK?
>>>
>>> Kind regards
>>>
>>> Javier
>>>
>>> On 05/11/12 13:47, srodriguez wrote:
>>>> Hi Javier,
>>>>
>>>> Thank you for your answer.
>>>>
>>>> Actually, I would like to obtain, 1 file per protein query aligned 
>>>> to all other species ortholog proteins (and not 1 sequence to 1 
>>>> sequence).
>>>>
>>>> ex:
>>>> for protein ENSBTAP00000032594, the file containing:
>>>> ENSBTAP00000032594/1-397 
>>>> MDALRASAAKPPTGRKMKARAPPPPGKPATPNLHSGQRSPRRASPGPPQNQLSR
>>>> ENSP00000265136/1-1261 
>>>> MDAPRASAAKPPTGRKMKARAPPPPGKAATLHVHSDQKPPHDGALGSQQNLVRMK
>>>> ENSSPECIE2...
>>>> ENSSPECIE3...
>>>>                         *** ***********************.** ::**.*:.*: 
>>>> .: *. ** :
>>>>
>>>> Also, I would like to have 1 file per protein from the query, and 
>>>> if a gene has several proteins, obtain all the proteins query as 
>>>> single files with the alignment as above.
>>>>
>>>> Do you know if it is feasible to obtain such an output with Ensembl 
>>>> compara?
>>>>
>>>> In that case, could you please modify the script to obtain it?
>>>>
>>>> Thank you very much in advance.
>>>>
>>>> Best regards,
>>>>
>>>> Sabrina.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Javier Herrero <jherrero at ebi.ac.uk> a écrit :
>>>>
>>>>> Dear Sabrina
>>>>>
>>>>> I have modified the script slightly only. Essentially, I have 
>>>>> removed some bits that were not required and cleaned up the code a 
>>>>> little. I have also added the possibility of specifying the query 
>>>>> and the target species in the command line. Last, I have also 
>>>>> changed the script to output the alignments into separate files.
>>>>>
>>>>> Your strategy using the ENSEMBLGENE was correct. Indeed, you get 
>>>>> two proteins aligned. I believe this is what you want, isn't it?
>>>>>
>>>>> I have added a few comments. Let me know if there something that 
>>>>> is not clear.
>>>>>
>>>>> Javier
>>>>>
>>>>> On 22/10/12 15:58, srodriguez wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> I would like to use compara EnsEMBL API to get the aligned 
>>>>>> protein sequences of a query animal with homologous protein 
>>>>>> sequences from other species.
>>>>>>
>>>>>> The script would take as input the query specie name, (and if 
>>>>>> possible the hit species names). The script would get the 
>>>>>> proteins of the query organism, then the homologous protein 
>>>>>> sequences, and then retrieves 1 file per protein query sequence 
>>>>>> containing the alignment of the query (placed as the first 
>>>>>> sequence) and then the other specie protein sequences aligned.
>>>>>>
>>>>>> I was thinking about using an "homology adaptor" with ENSEMBLPEP, 
>>>>>> so I started a script that way, but I do not obtain any results 
>>>>>> with ENSEMBLPEP and the results with ENSEMBLGENE are 2 sequences 
>>>>>> per alignment (see script attached).
>>>>>>
>>>>>> I also tried with "families", but sometimes, I do not get the 
>>>>>> protein sequence for my specie query in the sequence alignment 
>>>>>> even though I searched by using my taxon id (script N#2 attached).
>>>>>>
>>>>>> Would you have a script that already performs my goal?
>>>>>>
>>>>>> If not, could you please help me reaching my goal?
>>>>>>
>>>>>> Thank you very much in advance.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Sabrina.
>>>>>>
>>>>>>
>>>>>> *******************************************
>>>>>> Sabrina Rodriguez
>>>>>> Bioinformatics
>>>>>> Département de Génétique animale
>>>>>> Unité GABI
>>>>>> Domaine de Vilvert
>>>>>> 78532 Jouy en josas
>>>>>>
>>>>>> +33 (0) 1 34 65 29 53
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Dev mailing list Dev at ensembl.org
>>>>>> Posting guidelines and subscribe/unsubscribe info: 
>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>
>>>>> -- 
>>>>> Javier Herrero, PhD
>>>>> Ensembl Coordinator and Ensembl Compara Project Leader
>>>>> European Bioinformatics Institute (EMBL-EBI)
>>>>> Wellcome Trust Genome Campus, Hinxton
>>>>> Cambridge - CB10 1SD - UK
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *******************************************
>>>> Sabrina Rodriguez
>>>> Bioinformatics
>>>> Département de Génétique animale
>>>> Unité GABI
>>>> Domaine de Vilvert
>>>> 78532 Jouy en josas
>>>>
>>>> +33 (0) 1 34 65 29 53
>>>>
>>>>
>>>> _______________________________________________
>>>> Dev mailing listDev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe 
>>>> info:http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog:http://www.ensembl.info/
>>>
>>> -- 
>>> Javier Herrero, PhD
>>> Ensembl Coordinator and Ensembl Compara Project Leader
>>> European Bioinformatics Institute (EMBL-EBI)
>>> Wellcome Trust Genome Campus, Hinxton
>>> Cambridge - CB10 1SD - UK
>>
>> -- 
>> Javier Herrero, PhD
>> Ensembl Coordinator and Ensembl Compara Project Leader
>> European Bioinformatics Institute (EMBL-EBI)
>> Wellcome Trust Genome Campus, Hinxton
>> Cambridge - CB10 1SD - UK
>>
>>
>
>
>
>
> *******************************************
> Sabrina Rodriguez
> Bioinformatics
> Département de Génétique animale
> Unité GABI
> Domaine de Vilvert
> 78532 Jouy en josas
>
> +33 (0) 1 34 65 29 53
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: 
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>

-- 
Javier Herrero, PhD
Ensembl Coordinator and Ensembl Compara Project Leader
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus, Hinxton
Cambridge - CB10 1SD - UK