[ensembl-dev] Ensembl Compara Member Objects Associated With Multiple Single-Member Family Stable IDs

usman Ali pmasuaar200 at gmail.com
Wed Jun 13 14:07:36 BST 2012


sir i want to clear my concept about gene tree. i mean what is gene tree.

On Tue, Jun 12, 2012 at 11:15 PM, Christopher Kelly <cpjkelly at gmail.com>wrote:

> Thank you very much for the timely and helpful response, Javier.
>
> Cheers,
>
> Chris
>
> On 2012-06-11, at 10:03 PM, Javier Herrero wrote:
>
> Dear Chris
>
> First at all, it seems your script is correct. See
> http://www.ensembl.org/Homo_sapiens/Gene/Family?g=ENSG00000127054;r=1:1246965-1260071for the list of families on the web.
>
> We expect some genes to be part of several families. This is because we
> cluster the proteins, not the genes themselves. In the case of this gene,
> there are many different alternative transcripts that have been annotated.
> The richness of annotation comes from Havana, the group at the Sanger doing
> the manual annotation of the genome. In this case, a few of the transcripts
> appear as singletons simply because the resulting protein is not similar
> enough to any the other proteins used in this analysis. The set of proteins
> we use include all the Ensembl proteins plus all the metazoan proteins in
> UniProt.
>
> If you wish to remove the singleton's, you can test the number of protein
> members in the family. Here is an untested snippet that would do this:
>
> my $num = grep {$_->source_name eq "ENSEMBLPEP"}
> @{$family->get_all_Members}
>
> I hope this helps
>
> Javier
>
> Sent from my Kindle Fire
>
>
> ------------------------------
> *From:* Christopher Kelly <cpjkelly at gmail.com>
> *Sent:* Tue Jun 12 01:36:57 GMT+01:00 2012
> *To:* Ensembl developers list <dev at ensembl.org>
> *Subject:* [ensembl-dev] Ensembl Compara Member Objects Associated With
> Multiple Single-Member Family Stable IDs
>
> Hello all,
>
> I have written a script to fetch all members in the ensembl compara 67 database using the member_adaptor->fetch_all_by_source_genome_db_id() method.
>
> Next the script fetches all families associated with each member that was fetched in the previous step, using family_adaptor->fetch_all_by_Member().
>
> When invoked, the script is passed the genome ids of each species to fetch family/member data for, and uses ENSEMBLGENE as the source for member_adaptor->fetch_all_by_source_genome_db_id.
>
> The script outputs list files for each scaffold, with the stable id of each fetched member placed in its respective scaffold file. In addition, the script outputs a file (families file) containing the stable ids of each family next to the members associated with each family, with singleton members simply placed beside their own stable_id.
>
> Th
>  e
> problem with this script (and where I hope someone can provide guidance) is that, for some genes, the families file list several families. For such genes, the first family listed usually contains members other than that gene, but the subsequent listings are family stable ids that link to families containing only that gene. For example, the following is an excerpt from the families file generated from an the homo_sapiens genome_db:
>
>
> ENSG00000160087	ENSFM00580000910228
> ENSG00000162572	ENSFM00250000001147
> ENSG00000131584	ENSFM00250000000926
> ENSG00000131584	ENSFM00550000746913
> ENSG00000131584	ENSFM00550000756970
> ENSG00000169972	ENSFM00250000006052
> ENSG00000127054	ENSFM00280000058718
> ENSG00000127054	ENSFM00610000969020
> ENSG00000127054	ENSFM00610000969699
> ENSG00000127054	ENSFM00610000969700
> ENSG00000127054	ENSFM00610000969886
> ENSG00000127054	ENSFM00610000973240
> ENSG00000127054	ENSFM00610000973241
> ENSG00000127054	ENSFM00560000783793
> ENSG00000127054	ENSFM00610000973242
> ENSG00000127054	ENSFM00610000969761
> ENSG00000127054	ENSFM00610000973243
>
> Notice that the gene ENSG00000127054 is associated with multiple unique families. When entered into the ensembl genome browser, the first family entry for this gene returns a family with multiple orthologous members (which is expected). However, the subsequent family entries for this same gene, when entered into the ensembl genome browser, all return a family whose only member is ENSG00000127054. In fact, as far as I can tell, each of these family entries is identical, aside from stable id.
>
> Would anyone be able to provide me with insight regarding:
>
> a) The reason for multiple entries for singleton families(I am assuming that the problem is not due to different assemblies since I am searching only ensembl compare release 67).
>
> and/or
>
> b) How I can filter out these
> singleton families reliably using the API.
>
> Thanks in advance,
>
> Chris Kelly
>
>
> Here is my script, trimmed of pieces irrelevant to this question:
>
>
>
>
>
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
> use Bio::EnsEMBL::Registry;
> use Getopt::Long; # for GetOptions used for command line args
> use Scalar::Util qw(openhandle); # for openhandle()
>
> my @genome_db_id_list = ();
> my $output_file = "families.table"; #output file
>
> #Get command line arguments
> GetOptions("id=s" => \@genome_db_id_list, "output=s" => \$output_file,);
>
> unless (@genome_db_id_list){
>     die "No genome_db_id specified. Exiting...";
> }
> my $registry='Bio::EnsEMBL::Registry';
>
> print "\n\nLoading registry...";
>
> $registry->load_registry_from_db(
>     -host => 'ensembldb.ensembl.org',
>     -use
>  r =>
> 'anonymous',
>     -port => '5306'); # add -verbose => '1' for more verbose output
>
> print "Done";
>
> #get member adaptor
> my $member_adaptor = $registry->get_adaptor('Multi', 'compara', 'Member');
>
> my $family_adaptor = $registry->get_adaptor('Multi', 'compara', 'Family');
>
> my @member_list;
>
> #outside for-loop to iterate body of program for each genome_id specified on the command line
> foreach my $species_genome_db_id (@genome_db_id_list){
>
>     #Fetch all members of given species (specified by genome_db_id) from given source.
>     #Source options are: 'ENSEMBLGENE', 'ENSEMBLPEP', 'Uniprot/SPTREMBL',
>     #'Uniprot/SWISSPROT', 'ENSEMBLTRANS', 'EXTERNALCDS'.
>     #Each species has a unique genome_db_id in the current ensembl compara db version.
>     sub get_members_list {
>
>         my($source, $genome_db_id, @members) = @_;
>
>         #fetch members list
>  -
> returns listref of members
>         my $new_members_ref = $member_adaptor->fetch_all_by_source_genome_db_id("$source", "$genome_db_id");
>
>         #dereference members list ref
>         my @new_members = @$new_members_ref;
>
>         #join new_members list to the list of members
>         push(@members, @new_members);
>         @members;
>     }
>
>
>     #Get members from all sources for the given genome_db_id (denoting a specific species)
>     @member_list = get_members_list('ENSEMBLGENE', $species_genome_db_id, @member_list);
>
>     #open the output file for overwriting
>     open (OUTPUT, ">$output_file") or die "Could not open: $!";
>
>     #initialise @families_list as an empty list
>     my @families_list=();
>
>     #Fetch all families that each member belongs to, as well as
>     #create .lst files for each chromosome , and print member stable ids
>     #
> and strand orientation to .lst files
>     foreach my $single_member (@member_list) {
>
>         #get member chr strand and name of for use by subsequent commands
>         my $chr_name = $single_member->chr_name();
>         my $chr_strand;
>
>         if ($single_member->chr_strand() == -1){
>
>                 $chr_strand = '-';
>         }
>         elsif ($single_member->chr_strand() == 1) {
>
>             $chr_strand = '+';
>         }
>
>         else{
>
>             $chr_strand = '';
>         }
>
>         #get stable id for member
>         my $mem_stab_id = $single_member->stable_id();
>
>         #construct .lst file names of the format<id>_<chr>.txt
>         my $lst_file_name = "$species_genome_db_id"."_"."$chr_name".".lst";
>
>         open(my $fh,
> ">>$lst_file_name");
>
>         print $fh ("$mem_stab_id"."$chr_strand"."\n");
>         close($fh);
>
>         my $new_families_list_ref = $family_adaptor->fetch_all_by_Member($single_member);
>         my @new_families_list = @$new_families_list_ref;
>
>         if (@new_families_list){
>
>             #print member stable id next to each corresponding family stable id
>             foreach my $single_family (@new_families_list){
>                my $fam_stab_id = $single_family->stable_id();
>               print OUTPUT "$mem_stab_id\t$fam_stab_id\n";
>             }
>         }
>         else{
>             print OUTPUT "$mem_stab_id\t$mem_stab_id\n";
>         }
>
>         $count++;
>
>     }
> }
>
>
>
>
> ------------------------------
>
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe):
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe):
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20120613/af5f021b/attachment.html>


More information about the Dev mailing list