[ensembl-dev] EnsEMBL compara / protein sequence alignments

Wed Nov 7 11:25:11 GMT 2012

Hi Sabrina

It is certainly possible to get proteins from several species.

If you are interested in getting alignments for all possible isoforms 
(each possible protein from each gene), you would have to use the 
Ensembl families. These are groups of similar proteins, but you should 
not assume that they are all orthologues. To infer orthology, you need a 
phylogenetic tree. The trees we provide are built using only one single 
representative protein per gene.

In your case, I would recommend to use the Ensembl families, query the 
families using each cow (this is you query species, isn't it?) protein 
and dump the alignments. There are several options for this. You may 
want to use all possible species (the families are built using Ensembl 
and non-Ensembl proteins) or limit the alignment to a subset of species. 
Also, in some cases you will find that more than one cow proteins are in 
the same family, so you will get duplicated alignments. Is this OK?

Kind regards

Javier

On 05/11/12 13:47, srodriguez wrote:
> Hi Javier,
>
> Thank you for your answer.
>
> Actually, I would like to obtain, 1 file per protein query aligned to 
> all other species ortholog proteins (and not 1 sequence to 1 sequence).
>
> ex:
> for protein ENSBTAP00000032594, the file containing:
> ENSBTAP00000032594/1-397 
> MDALRASAAKPPTGRKMKARAPPPPGKPATPNLHSGQRSPRRASPGPPQNQLSR
> ENSP00000265136/1-1261 
> MDAPRASAAKPPTGRKMKARAPPPPGKAATLHVHSDQKPPHDGALGSQQNLVRMK
> ENSSPECIE2...
> ENSSPECIE3...
>                          *** ***********************.** ::**.*:.*: .: 
> *. ** :
>
> Also, I would like to have 1 file per protein from the query, and if a 
> gene has several proteins, obtain all the proteins query as single 
> files with the alignment as above.
>
> Do you know if it is feasible to obtain such an output with Ensembl 
> compara?
>
> In that case, could you please modify the script to obtain it?
>
> Thank you very much in advance.
>
> Best regards,
>
> Sabrina.
>
>
>
>
>
>
> Javier Herrero <jherrero at ebi.ac.uk> a écrit :
>
>> Dear Sabrina
>>
>> I have modified the script slightly only. Essentially, I have removed 
>> some bits that were not required and cleaned up the code a little. I 
>> have also added the possibility of specifying the query and the 
>> target species in the command line. Last, I have also changed the 
>> script to output the alignments into separate files.
>>
>> Your strategy using the ENSEMBLGENE was correct. Indeed, you get two 
>> proteins aligned. I believe this is what you want, isn't it?
>>
>> I have added a few comments. Let me know if there something that is 
>> not clear.
>>
>> Javier
>>
>> On 22/10/12 15:58, srodriguez wrote:
>>> Dear all,
>>>
>>> I would like to use compara EnsEMBL API to get the aligned protein 
>>> sequences of a query animal with homologous protein sequences from 
>>> other species.
>>>
>>> The script would take as input the query specie name, (and if 
>>> possible the hit species names). The script would get the proteins 
>>> of the query organism, then the homologous protein sequences, and 
>>> then retrieves 1 file per protein query sequence containing the 
>>> alignment of the query (placed as the first sequence) and then the 
>>> other specie protein sequences aligned.
>>>
>>> I was thinking about using an "homology adaptor" with ENSEMBLPEP, so 
>>> I started a script that way, but I do not obtain any results with 
>>> ENSEMBLPEP and the results with ENSEMBLGENE are 2 sequences per 
>>> alignment (see script attached).
>>>
>>> I also tried with "families", but sometimes, I do not get the 
>>> protein sequence for my specie query in the sequence alignment even 
>>> though I searched by using my taxon id (script N#2 attached).
>>>
>>> Would you have a script that already performs my goal?
>>>
>>> If not, could you please help me reaching my goal?
>>>
>>> Thank you very much in advance.
>>>
>>> Best regards,
>>>
>>> Sabrina.
>>>
>>>
>>> *******************************************
>>> Sabrina Rodriguez
>>> Bioinformatics
>>> Département de Génétique animale
>>> Unité GABI
>>> Domaine de Vilvert
>>> 78532 Jouy en josas
>>>
>>> +33 (0) 1 34 65 29 53
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: 
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>
>> -- 
>> Javier Herrero, PhD
>> Ensembl Coordinator and Ensembl Compara Project Leader
>> European Bioinformatics Institute (EMBL-EBI)
>> Wellcome Trust Genome Campus, Hinxton
>> Cambridge - CB10 1SD - UK
>>
>>
>
>
>
>
> *******************************************
> Sabrina Rodriguez
> Bioinformatics
> Département de Génétique animale
> Unité GABI
> Domaine de Vilvert
> 78532 Jouy en josas
>
> +33 (0) 1 34 65 29 53
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-- 
Javier Herrero, PhD
Ensembl Coordinator and Ensembl Compara Project Leader
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus, Hinxton
Cambridge - CB10 1SD - UK

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20121107/3745ac17/attachment.html>