[ensembl-dev] EnsEMBL compara / protein sequence alignments

Thu Nov 15 11:06:03 GMT 2012

Hi Sabrina,

There is in fact some code in the Variation API that will fetch the appropriate Compara family for a given protein ID and then create an alignment file in the format expected by SIFT. Have a look at the dump_alignment_for_sift subroutine in the Bio::EnsEMBL::Variation::Utils::ComparaUtils module, documentation here:

http://www.ensembl.org/info/docs/Doxygen/variation-api/classBio_1_1EnsEMBL_1_1Variation_1_1Utils_1_1ComparaUtils.html

If you do need to restrict the alignment to certain species you could probably adapt this code to do so.

Cheers,

Graham

On 15 Nov 2012, at 02:34, Javier Herrero <jherrero at ebi.ac.uk> wrote:

> Hi Sabrina
> 
> I see. We have been considering using the family alignments for SIFT as well. You can get the alignments from the families, you don't need to re-align the proteins. Ensembl families contain very closely-related proteins.
> 
> The script you need to get the alignment from a family is fairly simple and we can help you with this. Is there any reason why you want to restrict your analysis to a given set of genomes? Would you consider using all proteins in the family (again, they are all very similar proteins)?
> 
> Javier
> 
> On 14/11/12 17:25, srodriguez wrote:
>> Hi Javier,
>> 
>> Actually, we would like to obtain alignments of orthologous proteins isoforms.
>> These alignments would be used as entries for SIFT program.
>> 
>> After testing different methods (families...), I am still not sure about the best way to get these alignments with Ensembl...
>> 
>> I am thinking about starting from genes from my query specie. For each query specie gene, I would get all orthologs genes from my "hit species" and then get all their proteins. Then I would align these proteins to get orthologous proteins isoforms.
>> 
>> What do you think about this eventual method?
>> 
>> Best regards,
>> 
>> Sabrina.
>> 
>> 
>> Javier Herrero <jherrero at ebi.ac.uk> a écrit :
>> 
>>> BTW, we have an example script (ensembl-compara/scripts/examples/families_workshop_fetchFamilyAlignment.pl) that does something very similar to what you want (but just for one gene).
>>> 
>>> Javier
>>> 
>>> On 07/11/12 11:25, Javier Herrero wrote:
>>>> Hi Sabrina
>>>> 
>>>> It is certainly possible to get proteins from several species.
>>>> 
>>>> If you are interested in getting alignments for all possible isoforms (each possible protein from each gene), you would have to use the Ensembl families. These are groups of similar proteins, but you should not assume that they are all orthologues. To infer orthology, you need a phylogenetic tree. The trees we provide are built using only one single representative protein per gene.
>>>> 
>>>> In your case, I would recommend to use the Ensembl families, query the families using each cow (this is you query species, isn't it?) protein and dump the alignments. There are several options for this. You may want to use all possible species (the families are built using Ensembl and non-Ensembl proteins) or limit the alignment to a subset of species. Also, in some cases you will find that more than one cow proteins are in the same family, so you will get duplicated alignments. Is this OK?
>>>> 
>>>> Kind regards
>>>> 
>>>> Javier
>>>> 
>>>> On 05/11/12 13:47, srodriguez wrote:
>>>>> Hi Javier,
>>>>> 
>>>>> Thank you for your answer.
>>>>> 
>>>>> Actually, I would like to obtain, 1 file per protein query aligned to all other species ortholog proteins (and not 1 sequence to 1 sequence).
>>>>> 
>>>>> ex:
>>>>> for protein ENSBTAP00000032594, the file containing:
>>>>> ENSBTAP00000032594/1-397 MDALRASAAKPPTGRKMKARAPPPPGKPATPNLHSGQRSPRRASPGPPQNQLSR
>>>>> ENSP00000265136/1-1261 MDAPRASAAKPPTGRKMKARAPPPPGKAATLHVHSDQKPPHDGALGSQQNLVRMK
>>>>> ENSSPECIE2...
>>>>> ENSSPECIE3...
>>>>>                        *** ***********************.** ::**.*:.*: .: *. ** :
>>>>> 
>>>>> Also, I would like to have 1 file per protein from the query, and if a gene has several proteins, obtain all the proteins query as single files with the alignment as above.
>>>>> 
>>>>> Do you know if it is feasible to obtain such an output with Ensembl compara?
>>>>> 
>>>>> In that case, could you please modify the script to obtain it?
>>>>> 
>>>>> Thank you very much in advance.
>>>>> 
>>>>> Best regards,
>>>>> 
>>>>> Sabrina.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Javier Herrero <jherrero at ebi.ac.uk> a écrit :
>>>>> 
>>>>>> Dear Sabrina
>>>>>> 
>>>>>> I have modified the script slightly only. Essentially, I have removed some bits that were not required and cleaned up the code a little. I have also added the possibility of specifying the query and the target species in the command line. Last, I have also changed the script to output the alignments into separate files.
>>>>>> 
>>>>>> Your strategy using the ENSEMBLGENE was correct. Indeed, you get two proteins aligned. I believe this is what you want, isn't it?
>>>>>> 
>>>>>> I have added a few comments. Let me know if there something that is not clear.
>>>>>> 
>>>>>> Javier
>>>>>> 
>>>>>> On 22/10/12 15:58, srodriguez wrote:
>>>>>>> Dear all,
>>>>>>> 
>>>>>>> I would like to use compara EnsEMBL API to get the aligned protein sequences of a query animal with homologous protein sequences from other species.
>>>>>>> 
>>>>>>> The script would take as input the query specie name, (and if possible the hit species names). The script would get the proteins of the query organism, then the homologous protein sequences, and then retrieves 1 file per protein query sequence containing the alignment of the query (placed as the first sequence) and then the other specie protein sequences aligned.
>>>>>>> 
>>>>>>> I was thinking about using an "homology adaptor" with ENSEMBLPEP, so I started a script that way, but I do not obtain any results with ENSEMBLPEP and the results with ENSEMBLGENE are 2 sequences per alignment (see script attached).
>>>>>>> 
>>>>>>> I also tried with "families", but sometimes, I do not get the protein sequence for my specie query in the sequence alignment even though I searched by using my taxon id (script N#2 attached).
>>>>>>> 
>>>>>>> Would you have a script that already performs my goal?
>>>>>>> 
>>>>>>> If not, could you please help me reaching my goal?
>>>>>>> 
>>>>>>> Thank you very much in advance.
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> 
>>>>>>> Sabrina.
>>>>>>> 
>>>>>>> 
>>>>>>> *******************************************
>>>>>>> Sabrina Rodriguez
>>>>>>> Bioinformatics
>>>>>>> Département de Génétique animale
>>>>>>> Unité GABI
>>>>>>> Domaine de Vilvert
>>>>>>> 78532 Jouy en josas
>>>>>>> 
>>>>>>> +33 (0) 1 34 65 29 53
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Dev mailing list Dev at ensembl.org
>>>>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>> 
>>>>>> -- 
>>>>>> Javier Herrero, PhD
>>>>>> Ensembl Coordinator and Ensembl Compara Project Leader
>>>>>> European Bioinformatics Institute (EMBL-EBI)
>>>>>> Wellcome Trust Genome Campus, Hinxton
>>>>>> Cambridge - CB10 1SD - UK
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> *******************************************
>>>>> Sabrina Rodriguez
>>>>> Bioinformatics
>>>>> Département de Génétique animale
>>>>> Unité GABI
>>>>> Domaine de Vilvert
>>>>> 78532 Jouy en josas
>>>>> 
>>>>> +33 (0) 1 34 65 29 53
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Dev mailing listDev at ensembl.org
>>>>> Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog:http://www.ensembl.info/
>>>> 
>>>> -- 
>>>> Javier Herrero, PhD
>>>> Ensembl Coordinator and Ensembl Compara Project Leader
>>>> European Bioinformatics Institute (EMBL-EBI)
>>>> Wellcome Trust Genome Campus, Hinxton
>>>> Cambridge - CB10 1SD - UK
>>> 
>>> -- 
>>> Javier Herrero, PhD
>>> Ensembl Coordinator and Ensembl Compara Project Leader
>>> European Bioinformatics Institute (EMBL-EBI)
>>> Wellcome Trust Genome Campus, Hinxton
>>> Cambridge - CB10 1SD - UK
>>> 
>>> 
>> 
>> 
>> 
>> 
>> *******************************************
>> Sabrina Rodriguez
>> Bioinformatics
>> Département de Génétique animale
>> Unité GABI
>> Domaine de Vilvert
>> 78532 Jouy en josas
>> 
>> +33 (0) 1 34 65 29 53
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>> 
> 
> -- 
> Javier Herrero, PhD
> Ensembl Coordinator and Ensembl Compara Project Leader
> European Bioinformatics Institute (EMBL-EBI)
> Wellcome Trust Genome Campus, Hinxton
> Cambridge - CB10 1SD - UK
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/