[ensembl-dev] EnsEMBL compara / protein sequence alignments
srodriguez
srodriguez at jouy.inra.fr
Wed Nov 14 08:25:41 GMT 2012
Hi Javier,
Actually, we would like to obtain alignments of orthologous proteins isoforms.
These alignments would be used as entries for SIFT program.
After testing different methods (families...), I am still not sure
about the best way to get these alignments with Ensembl...
I am thinking about starting from genes from my query specie. For each
query specie gene, I would get all orthologs genes from my "hit
species" and then get all their proteins. Then I would align these
proteins to get orthologous proteins isoforms.
What do you think about this eventual method?
Best regards,
Sabrina.
Javier Herrero <jherrero at ebi.ac.uk> a écrit :
> BTW, we have an example script
> (ensembl-compara/scripts/examples/families_workshop_fetchFamilyAlignment.pl)
> that does something very similar to what you want (but just for one
> gene).
>
> Javier
>
> On 07/11/12 11:25, Javier Herrero wrote:
>> Hi Sabrina
>>
>> It is certainly possible to get proteins from several species.
>>
>> If you are interested in getting alignments for all possible
>> isoforms (each possible protein from each gene), you would have to
>> use the Ensembl families. These are groups of similar proteins, but
>> you should not assume that they are all orthologues. To infer
>> orthology, you need a phylogenetic tree. The trees we provide are
>> built using only one single representative protein per gene.
>>
>> In your case, I would recommend to use the Ensembl families, query
>> the families using each cow (this is you query species, isn't it?)
>> protein and dump the alignments. There are several options for
>> this. You may want to use all possible species (the families are
>> built using Ensembl and non-Ensembl proteins) or limit the
>> alignment to a subset of species. Also, in some cases you will find
>> that more than one cow proteins are in the same family, so you will
>> get duplicated alignments. Is this OK?
>>
>> Kind regards
>>
>> Javier
>>
>> On 05/11/12 13:47, srodriguez wrote:
>>> Hi Javier,
>>>
>>> Thank you for your answer.
>>>
>>> Actually, I would like to obtain, 1 file per protein query aligned
>>> to all other species ortholog proteins (and not 1 sequence to 1
>>> sequence).
>>>
>>> ex:
>>> for protein ENSBTAP00000032594, the file containing:
>>> ENSBTAP00000032594/1-397
>>> MDALRASAAKPPTGRKMKARAPPPPGKPATPNLHSGQRSPRRASPGPPQNQLSR
>>> ENSP00000265136/1-1261
>>> MDAPRASAAKPPTGRKMKARAPPPPGKAATLHVHSDQKPPHDGALGSQQNLVRMK
>>> ENSSPECIE2...
>>> ENSSPECIE3...
>>> *** ***********************.** ::**.*:.*:
>>> .: *. ** :
>>>
>>> Also, I would like to have 1 file per protein from the query, and
>>> if a gene has several proteins, obtain all the proteins query as
>>> single files with the alignment as above.
>>>
>>> Do you know if it is feasible to obtain such an output with
>>> Ensembl compara?
>>>
>>> In that case, could you please modify the script to obtain it?
>>>
>>> Thank you very much in advance.
>>>
>>> Best regards,
>>>
>>> Sabrina.
>>>
>>>
>>>
>>>
>>>
>>>
>>> Javier Herrero <jherrero at ebi.ac.uk> a écrit :
>>>
>>>> Dear Sabrina
>>>>
>>>> I have modified the script slightly only. Essentially, I have
>>>> removed some bits that were not required and cleaned up the code
>>>> a little. I have also added the possibility of specifying the
>>>> query and the target species in the command line. Last, I have
>>>> also changed the script to output the alignments into separate
>>>> files.
>>>>
>>>> Your strategy using the ENSEMBLGENE was correct. Indeed, you get
>>>> two proteins aligned. I believe this is what you want, isn't it?
>>>>
>>>> I have added a few comments. Let me know if there something that
>>>> is not clear.
>>>>
>>>> Javier
>>>>
>>>> On 22/10/12 15:58, srodriguez wrote:
>>>>> Dear all,
>>>>>
>>>>> I would like to use compara EnsEMBL API to get the aligned
>>>>> protein sequences of a query animal with homologous protein
>>>>> sequences from other species.
>>>>>
>>>>> The script would take as input the query specie name, (and if
>>>>> possible the hit species names). The script would get the
>>>>> proteins of the query organism, then the homologous protein
>>>>> sequences, and then retrieves 1 file per protein query sequence
>>>>> containing the alignment of the query (placed as the first
>>>>> sequence) and then the other specie protein sequences aligned.
>>>>>
>>>>> I was thinking about using an "homology adaptor" with
>>>>> ENSEMBLPEP, so I started a script that way, but I do not obtain
>>>>> any results with ENSEMBLPEP and the results with ENSEMBLGENE are
>>>>> 2 sequences per alignment (see script attached).
>>>>>
>>>>> I also tried with "families", but sometimes, I do not get the
>>>>> protein sequence for my specie query in the sequence alignment
>>>>> even though I searched by using my taxon id (script N#2 attached).
>>>>>
>>>>> Would you have a script that already performs my goal?
>>>>>
>>>>> If not, could you please help me reaching my goal?
>>>>>
>>>>> Thank you very much in advance.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Sabrina.
>>>>>
>>>>>
>>>>> *******************************************
>>>>> Sabrina Rodriguez
>>>>> Bioinformatics
>>>>> Département de Génétique animale
>>>>> Unité GABI
>>>>> Domaine de Vilvert
>>>>> 78532 Jouy en josas
>>>>>
>>>>> +33 (0) 1 34 65 29 53
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Dev mailing list Dev at ensembl.org
>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>> --
>>>> Javier Herrero, PhD
>>>> Ensembl Coordinator and Ensembl Compara Project Leader
>>>> European Bioinformatics Institute (EMBL-EBI)
>>>> Wellcome Trust Genome Campus, Hinxton
>>>> Cambridge - CB10 1SD - UK
>>>>
>>>>
>>>
>>>
>>>
>>>
>>> *******************************************
>>> Sabrina Rodriguez
>>> Bioinformatics
>>> Département de Génétique animale
>>> Unité GABI
>>> Domaine de Vilvert
>>> 78532 Jouy en josas
>>>
>>> +33 (0) 1 34 65 29 53
>>>
>>>
>>> _______________________________________________
>>> Dev mailing listDev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe
>>> info:http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog:http://www.ensembl.info/
>>
>> --
>> Javier Herrero, PhD
>> Ensembl Coordinator and Ensembl Compara Project Leader
>> European Bioinformatics Institute (EMBL-EBI)
>> Wellcome Trust Genome Campus, Hinxton
>> Cambridge - CB10 1SD - UK
>
> --
> Javier Herrero, PhD
> Ensembl Coordinator and Ensembl Compara Project Leader
> European Bioinformatics Institute (EMBL-EBI)
> Wellcome Trust Genome Campus, Hinxton
> Cambridge - CB10 1SD - UK
>
>
*******************************************
Sabrina Rodriguez
Bioinformatics
Département de Génétique animale
Unité GABI
Domaine de Vilvert
78532 Jouy en josas
+33 (0) 1 34 65 29 53
More information about the Dev
mailing list