[ensembl-dev] Filtering ncRNAs from a list of Member objects

Christopher Kelly cpjkelly at gmail.com
Tue Jul 10 19:06:06 BST 2012


Hello all,

I am using a script to fetch all Members associated with a given genome db id, and fetch the protein family (if any)each member belongs to.

This works fine. However, the comparative analysis program that uses the output of this script is producing less reliable results than would be desirable, since the human annotation seems to contain many ncRNA genes whose orthologues have not yet been identified in the annotations of many other species. 

In order to improve the accuracy of the analysis program, I would like to be able to filter out all ncRNA genes from my script output. 

The script usually fetches from ENSEMGLGENE. I have tried fetching from ENSEMBLPEP in order to filter out ncRNAs but this still reduces the quality of the output for analysis purposes.

Having sifted through a good deal of the Ensembl and Ensembl Compara Doxygen documentation, I have yet to find an accurate method that would do this for me.

Is there an accurate API function/method for filtering ncRNA-genes from a list of member or gene objects?


Thanks in advance,

Chris Kelly



Here is the relevant section of the script code:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous',
    -port => '5306'); # add -verbose => '1' for more verbose output  
                      # add -db_version => 'version' for a specific ensembl db version, otherwise script will search most current version

#get member adaptor
my $member_adaptor = $registry->get_adaptor('Multi', 'compara', 'Member');

my $family_adaptor = $registry->get_adaptor('Multi', 'compara', 'Family');

my @member_list;

#outside for-loop to iterate body of program for each id specified on the command line
foreach my $species_genome_db_id (@genome_db_id_list){
    
    @member_list = ();
    my $file_path = "$current_directory"."/"."$species_genome_db_id";
    mkdir $file_path, 0777;
    
    #Fetch all members of given species (specified by genome_db_id) from given source.
    #Source options are: 'ENSEMBLGENE', 'ENSEMBLPEP', 'Uniprot/SPTREMBL', 
    #'Uniprot/SWISSPROT', 'ENSEMBLTRANS', 'EXTERNALCDS'.
    #Each species has a unique genome_db_id in the current ensembl compara db version.
    sub get_members_list {
        
        my($source, $genome_db_id, @members) = @_;
    
        #fetch members list - returns listref of members
        my $new_members_ref = $member_adaptor->fetch_all_by_source_genome_db_id("$source", "$genome_db_id");
    
        #dereference members list ref
        my @new_members = @$new_members_ref;
    
        #join new_members list to the list of members
        push(@members, @new_members);
        @members;
    }   
    
    #
    #Get members from all sources for the given genome_db_id (denoting a specific species)
    #
    #@member_list = get_members_list('ENSEMBLGENE', $species_genome_db_id, @member_list);
    #print "ENSEMBLGENE members fetched\n";
    @member_list = get_members_list('ENSEMBLPEP', $species_genome_db_id, @member_list);
    print "ENSEMBLPEP members fetched\n";
    #@member_list = get_members_list('Uniprot/SPTREMBL', $species_genome_db_id, @member_list);
    #print "Uniprot/SPTREMBL members fetched\n";
    #@member_list = get_members_list('Uniprot/SWISSPROT', $species_genome_db_id, @member_list);
    #print "Uniprot/SWISSPROT members fetched\n";
    #@member_list = get_members_list('ENSEMBLTRANS', $species_genome_db_id, @member_list);
    #print "ENSEMBLTRANS members fetched\n";
    #@member_list = get_members_list('EXTERNALCDS', $species_genome_db_id, @member_list);
    #print "EXTERNALCDS members fetched\n";
}    
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



More information about the Dev mailing list