[ensembl-dev] EnsEMBL compara / protein sequence alignments
Javier Herrero
jherrero at ebi.ac.uk
Thu Nov 15 02:34:18 GMT 2012
Hi Sabrina
I see. We have been considering using the family alignments for SIFT as
well. You can get the alignments from the families, you don't need to
re-align the proteins. Ensembl families contain very closely-related
proteins.
The script you need to get the alignment from a family is fairly simple
and we can help you with this. Is there any reason why you want to
restrict your analysis to a given set of genomes? Would you consider
using all proteins in the family (again, they are all very similar
proteins)?
Javier
On 14/11/12 17:25, srodriguez wrote:
> Hi Javier,
>
> Actually, we would like to obtain alignments of orthologous proteins
> isoforms.
> These alignments would be used as entries for SIFT program.
>
> After testing different methods (families...), I am still not sure
> about the best way to get these alignments with Ensembl...
>
> I am thinking about starting from genes from my query specie. For each
> query specie gene, I would get all orthologs genes from my "hit
> species" and then get all their proteins. Then I would align these
> proteins to get orthologous proteins isoforms.
>
> What do you think about this eventual method?
>
> Best regards,
>
> Sabrina.
>
>
> Javier Herrero <jherrero at ebi.ac.uk> a écrit :
>
>> BTW, we have an example script
>> (ensembl-compara/scripts/examples/families_workshop_fetchFamilyAlignment.pl)
>> that does something very similar to what you want (but just for one
>> gene).
>>
>> Javier
>>
>> On 07/11/12 11:25, Javier Herrero wrote:
>>> Hi Sabrina
>>>
>>> It is certainly possible to get proteins from several species.
>>>
>>> If you are interested in getting alignments for all possible
>>> isoforms (each possible protein from each gene), you would have to
>>> use the Ensembl families. These are groups of similar proteins, but
>>> you should not assume that they are all orthologues. To infer
>>> orthology, you need a phylogenetic tree. The trees we provide are
>>> built using only one single representative protein per gene.
>>>
>>> In your case, I would recommend to use the Ensembl families, query
>>> the families using each cow (this is you query species, isn't it?)
>>> protein and dump the alignments. There are several options for this.
>>> You may want to use all possible species (the families are built
>>> using Ensembl and non-Ensembl proteins) or limit the alignment to a
>>> subset of species. Also, in some cases you will find that more than
>>> one cow proteins are in the same family, so you will get duplicated
>>> alignments. Is this OK?
>>>
>>> Kind regards
>>>
>>> Javier
>>>
>>> On 05/11/12 13:47, srodriguez wrote:
>>>> Hi Javier,
>>>>
>>>> Thank you for your answer.
>>>>
>>>> Actually, I would like to obtain, 1 file per protein query aligned
>>>> to all other species ortholog proteins (and not 1 sequence to 1
>>>> sequence).
>>>>
>>>> ex:
>>>> for protein ENSBTAP00000032594, the file containing:
>>>> ENSBTAP00000032594/1-397
>>>> MDALRASAAKPPTGRKMKARAPPPPGKPATPNLHSGQRSPRRASPGPPQNQLSR
>>>> ENSP00000265136/1-1261
>>>> MDAPRASAAKPPTGRKMKARAPPPPGKAATLHVHSDQKPPHDGALGSQQNLVRMK
>>>> ENSSPECIE2...
>>>> ENSSPECIE3...
>>>> *** ***********************.** ::**.*:.*:
>>>> .: *. ** :
>>>>
>>>> Also, I would like to have 1 file per protein from the query, and
>>>> if a gene has several proteins, obtain all the proteins query as
>>>> single files with the alignment as above.
>>>>
>>>> Do you know if it is feasible to obtain such an output with Ensembl
>>>> compara?
>>>>
>>>> In that case, could you please modify the script to obtain it?
>>>>
>>>> Thank you very much in advance.
>>>>
>>>> Best regards,
>>>>
>>>> Sabrina.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Javier Herrero <jherrero at ebi.ac.uk> a écrit :
>>>>
>>>>> Dear Sabrina
>>>>>
>>>>> I have modified the script slightly only. Essentially, I have
>>>>> removed some bits that were not required and cleaned up the code a
>>>>> little. I have also added the possibility of specifying the query
>>>>> and the target species in the command line. Last, I have also
>>>>> changed the script to output the alignments into separate files.
>>>>>
>>>>> Your strategy using the ENSEMBLGENE was correct. Indeed, you get
>>>>> two proteins aligned. I believe this is what you want, isn't it?
>>>>>
>>>>> I have added a few comments. Let me know if there something that
>>>>> is not clear.
>>>>>
>>>>> Javier
>>>>>
>>>>> On 22/10/12 15:58, srodriguez wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> I would like to use compara EnsEMBL API to get the aligned
>>>>>> protein sequences of a query animal with homologous protein
>>>>>> sequences from other species.
>>>>>>
>>>>>> The script would take as input the query specie name, (and if
>>>>>> possible the hit species names). The script would get the
>>>>>> proteins of the query organism, then the homologous protein
>>>>>> sequences, and then retrieves 1 file per protein query sequence
>>>>>> containing the alignment of the query (placed as the first
>>>>>> sequence) and then the other specie protein sequences aligned.
>>>>>>
>>>>>> I was thinking about using an "homology adaptor" with ENSEMBLPEP,
>>>>>> so I started a script that way, but I do not obtain any results
>>>>>> with ENSEMBLPEP and the results with ENSEMBLGENE are 2 sequences
>>>>>> per alignment (see script attached).
>>>>>>
>>>>>> I also tried with "families", but sometimes, I do not get the
>>>>>> protein sequence for my specie query in the sequence alignment
>>>>>> even though I searched by using my taxon id (script N#2 attached).
>>>>>>
>>>>>> Would you have a script that already performs my goal?
>>>>>>
>>>>>> If not, could you please help me reaching my goal?
>>>>>>
>>>>>> Thank you very much in advance.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Sabrina.
>>>>>>
>>>>>>
>>>>>> *******************************************
>>>>>> Sabrina Rodriguez
>>>>>> Bioinformatics
>>>>>> Département de Génétique animale
>>>>>> Unité GABI
>>>>>> Domaine de Vilvert
>>>>>> 78532 Jouy en josas
>>>>>>
>>>>>> +33 (0) 1 34 65 29 53
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Dev mailing list Dev at ensembl.org
>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>
>>>>> --
>>>>> Javier Herrero, PhD
>>>>> Ensembl Coordinator and Ensembl Compara Project Leader
>>>>> European Bioinformatics Institute (EMBL-EBI)
>>>>> Wellcome Trust Genome Campus, Hinxton
>>>>> Cambridge - CB10 1SD - UK
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *******************************************
>>>> Sabrina Rodriguez
>>>> Bioinformatics
>>>> Département de Génétique animale
>>>> Unité GABI
>>>> Domaine de Vilvert
>>>> 78532 Jouy en josas
>>>>
>>>> +33 (0) 1 34 65 29 53
>>>>
>>>>
>>>> _______________________________________________
>>>> Dev mailing listDev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe
>>>> info:http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog:http://www.ensembl.info/
>>>
>>> --
>>> Javier Herrero, PhD
>>> Ensembl Coordinator and Ensembl Compara Project Leader
>>> European Bioinformatics Institute (EMBL-EBI)
>>> Wellcome Trust Genome Campus, Hinxton
>>> Cambridge - CB10 1SD - UK
>>
>> --
>> Javier Herrero, PhD
>> Ensembl Coordinator and Ensembl Compara Project Leader
>> European Bioinformatics Institute (EMBL-EBI)
>> Wellcome Trust Genome Campus, Hinxton
>> Cambridge - CB10 1SD - UK
>>
>>
>
>
>
>
> *******************************************
> Sabrina Rodriguez
> Bioinformatics
> Département de Génétique animale
> Unité GABI
> Domaine de Vilvert
> 78532 Jouy en josas
>
> +33 (0) 1 34 65 29 53
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
--
Javier Herrero, PhD
Ensembl Coordinator and Ensembl Compara Project Leader
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus, Hinxton
Cambridge - CB10 1SD - UK
More information about the Dev
mailing list