[ensembl-dev] extracting human mouse between species out paralogs

Stamboulian, Mouses Hrag mstambou at indiana.edu
Wed Oct 14 16:16:38 BST 2015


Dear Matthieu, 

Thank you for the script, however I ran It a couple of times over night and it's not printing any results I also tried to print them on an external file, did not work. Attached are both your version of the code (btw_species_paralogues.pl) and my version of the code (btw_species_paralogues_file_output.pl) which I ran both. Whenever I run I receive this output. 

[mstambou at silo perl_scripts]$ perl original_btw_species_paralogues.pl
Loaded 32385 mus_musculus gene names
Loaded 43782 homo_sapiens gene names
Loaded 26429 orthologies
Loaded 20667 protein-trees
[1/20667 trees processed]
[2/20667 trees processed]
[3/20667 trees processed]
[4/20667 trees processed]
[5/20667 trees processed]
[6/20667 trees processed]
[7/20667 trees processed]
[8/20667 trees processed]
[9/20667 trees processed]

Please not that the file 'original_btw_species_paralgoues.pl' is the same as 'btw_species_paralogues.pl', I just renamed it. The script looks fine and should print all the paralogues but it's not. I don't understand why. Thank you.
________________________________________
From: dev-bounces at ensembl.org <dev-bounces at ensembl.org> on behalf of Matthieu Muffato <muffato at ebi.ac.uk>
Sent: Friday, October 9, 2015 9:17 AM
To: Ensembl developers list
Subject: Re: [ensembl-dev] extracting human mouse between species out   paralogs

Messages like "[10100/20667]" are printed on stderr. If you redirect the
output to a file, the file will be clean.

Yes the script will take some time to run because it has to parse all
the gene-trees.

I attach a new version of the script that also prints the gene stable
IDs, the taxonomy level of the duplication, and the confidence score
(species-intersection score) of the duplication nodes

Matthieu

On 08/10/15 23:31, Stamboulian, Mouses Hrag wrote:
> Dear Matthieu,
>
> Thanks a lot for the script. When I ran it I started to get output such as this (please find below). Also its taking very long time to print outputs, the script is running for hours. I that normal?
>
> I have some questions for the script's output. The lines that are printing outputs like this: [10100/20667], I did not really understand what they mean. what Im assuming is that these are for the orthologous gene pairs?
> am I safe to assume that? Also the other kind of output that Im getting: ENSGT00530000064989     ENSMUSP00000100418      ENSP00000439668 here I'm assuming that the first one is the gene tree ID and the next one is the paralogous protein ID for the mouse and the third one is the paralogue protein ID found in humans?
>
> also one last thing. Could we modify the script such that we can display the output in this format: Tree ID       Mouse_Gene_ID        Mouse_protein_ID         Human_gene_ID      Human_protein ID   and paralogy_Confidence?
>
> if the paralogy confidence could not be inferred than that's fine.
>
> Thanks a lot.
>
> [10100/20667]
> [10200/20667]
> [10300/20667]
> [10400/20667]
> [10500/20667]
> [10600/20667]
> [10700/20667]
> [10800/20667]
> [10900/20667]
> [11000/20667]
> [11100/20667]
> [11200/20667]
> [11300/20667]
> ENSGT00770000120830     ENSMUSP00000051355      ENSP00000476742
> [11400/20667]
> [11500/20667]
> ENSGT00530000064989     ENSMUSP00000100418      ENSP00000439668
> ENSGT00530000064989     ENSMUSP00000100418      ENSP00000446309
> ENSGT00530000064989     ENSMUSP00000143226      ENSP00000439668
> ENSGT00530000064989     ENSMUSP00000143226      ENSP00000446309
> ENSGT00530000064989     ENSMUSP00000136007      ENSP00000439668
>
> ________________________________________
> From: dev-bounces at ensembl.org <dev-bounces at ensembl.org> on behalf of Matthieu Muffato <muffato at ebi.ac.uk>
> Sent: Thursday, October 8, 2015 1:44 PM
> To: Ensembl developers list
> Subject: Re: [ensembl-dev] extracting human mouse between species out   paralogs
>
> Dear Mouses,
>
> Your description of BioMart and the API is correct but it doesn't work
> because we don't store between-species paralogs in the databases.
>
> A solution is to make a (more complicated) script that goes through all
> the gene-trees and select in these all the human-mouse pairs that are
> not orthologues. I attach a script that should work. Let me know if you
> find any issues
>
> Regards,
> Matthieu, Ensembl Compara
>
> On 08/10/15 00:29, Stamboulian, Mouses Hrag wrote:
>> Hi,
>>
>>
>> Im trying to extract the human mouse between species out-paralogs. I
>> tried using the GUI through ensemble biomart however could not able to
>> extract the needed data because when I select homo sapiens as my dataset
>> and then select homologs as the attribute, in the paralogs sections I
>> only have options to select the human paralogs (i.e. within species
>> paralogs) however no options for between species paralogs were found.
>>
>>
>> Furthermore I tried extracting the data through the perl API. In doing
>> so I tried too modify this script (please find below). In doing so I
>> tried to change the parameter at the bolded line in the code, in the
>> fetch_by_method_link_type_registry_aliases(), by replacing the
>> 'ENSEMBL_ORTHOLOGUES' by 'ENSEMBL_PARALOGUES' or 'ENSEMBLE_HOMOLOGUES'
>> hoping it would return paralogs or all the homologs in general. However
>> it failed to do that. I could not find what other parameters I could
>> pass to this method instead of 'ENSEMBL_ORTHOLOGUES' , as I could not
>> find it in your documentation present here:
>> http://www.ensembl.org/info/docs/Doxygen/compara-api/classBio_1_1EnsEMBL_1_1Compara_1_1DBSQL_1_1MethodLinkSpeciesSetAdaptor.html#aeb42739559569b62ee3bfab6da764976?
>>
>>
>> my question is what would be a script to retrieve such data?  Help
>> please. Thank you
>>
>>
>> use strict;
>> use warnings;
>>
>> use Bio::EnsEMBL::Registry;
>>
>> ## Load the registry automatically
>> my $reg = "Bio::EnsEMBL::Registry";
>> $reg->load_registry_from_url('mysql://anonymous@ensembldb.ensembl.org');
>>
>> ## Get the compara mlss adaptor
>> my $mlss_adaptor = $reg->get_adaptor("Multi", "compara",
>> "MethodLinkSpeciesSet");
>>
>> ## Get the compara homology adaptor
>> my $homology_adaptor = $reg->get_adaptor("Multi", "compara", "Homology");
>>
>> ## Species definition
>> my $species1 = 'human';
>> my $species2 = 'mouse';
>>
>> ## Get the MethodLinkSpeciesSet object describing the orthology between
>> the two species
>> *my $this_mlss =
>> $mlss_adaptor->fetch_by_method_link_type_registry_aliases('ENSEMBL_ORTHOLOGUES',
>> [$species1, $species2]);*
>>
>> ## Get all the homologues
>> my $all_homologies =
>> $homology_adaptor->fetch_all_by_MethodLinkSpeciesSet($this_mlss);
>>
>> ## For each homology
>> my $count = 0;
>> foreach my $this_homology (@{$all_homologies}) {
>>
>>     ## only keeps the one2one
>>     if ($this_homology->description() eq 'ortholog_one2one') {
>>       $count++;
>>     }
>> }
>>
>> print "There are $count 1-to-1 orthologues between $species1 and
>> $species2\n";
>>
>> ## Alternative (shorter) version
>> my $all_one2one =
>> $homology_adaptor->fetch_all_by_MethodLinkSpeciesSet($this_mlss,
>> -orthology_type => 'ortholog_one2one');
>>
>> print "It should be the same number as: ", scalar(@{$all_one2one}), "\n";
>>
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>
> --
> Matthieu Muffato, Ph.D.
> Ensembl Compara and TreeFam Project Leader
> European Bioinformatics Institute (EMBL-EBI)
> European Molecular Biology Laboratory
> Wellcome Trust Genome Campus, Hinxton
> Cambridge, CB10 1SD, United Kingdom
> Room  A3-145
> Phone + 44 (0) 1223 49 4631
> Fax   + 44 (0) 1223 49 4468
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>

--
Matthieu Muffato, Ph.D.
Ensembl Compara and TreeFam Project Leader
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus, Hinxton
Cambridge, CB10 1SD, United Kingdom
Room  A3-145
Phone + 44 (0) 1223 49 4631
Fax   + 44 (0) 1223 49 4468
-------------- next part --------------
A non-text attachment was scrubbed...
Name: btw_species_paralogues.pl
Type: application/octet-stream
Size: 3636 bytes
Desc: btw_species_paralogues.pl
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20151014/e9382e7f/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: btw_species_paralogues_file_output.pl
Type: application/octet-stream
Size: 3920 bytes
Desc: btw_species_paralogues_file_output.pl
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20151014/e9382e7f/attachment-0001.obj>


More information about the Dev mailing list