[ensembl-dev] Filtering ncRNAs from a list of Member objects

Daniel Hughes dsth at ebi.ac.uk
Fri Jul 13 19:31:38 BST 2012


I think perhaps the one you were after where you do indeed pass a biotype
is fetch_by_biotype from a gene adaptor.

Dan
On Jul 13, 2012 7:26 PM, "Javier Herrero" <jherrero at ebi.ac.uk> wrote:

> Hi Chris
>
> You want to use
>
> if ($gene->biotype() eq "protein_coding") {}
>
> instead. With $gene->biotype("protein_**coding") you are setting the
> biotype of that gene.
>
> Kind regards
>
> Javier
>
> On 13/07/12 18:01, Christopher Kelly wrote:
>
>> Having tried this, I've found that filtering for
>> gene->biotype("protein_coding"**), surprisingly, does not remove ncRNAs;
>> The list I fetched is still packed full of ncRNAs.
>>
>> Could anyone provide additional insight?
>>
>> Cheers,
>> Chris
>> On 2012-07-10, at 11:12 AM, José Afonso Guerra Assunção wrote:
>>
>>  Both genes and transcripts have a method called biotype.
>>> you can select for "protein_coding"...
>>>
>>> HTH,
>>> Jose
>>>
>>> On Tue, Jul 10, 2012 at 7:06 PM, Christopher Kelly <cpjkelly at gmail.com>
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am using a script to fetch all Members associated with a given genome
>>>> db id, and fetch the protein family (if any)each member belongs to.
>>>>
>>>> This works fine. However, the comparative analysis program that uses
>>>> the output of this script is producing less reliable results than would be
>>>> desirable, since the human annotation seems to contain many ncRNA genes
>>>> whose orthologues have not yet been identified in the annotations of many
>>>> other species.
>>>>
>>>> In order to improve the accuracy of the analysis program, I would like
>>>> to be able to filter out all ncRNA genes from my script output.
>>>>
>>>> The script usually fetches from ENSEMGLGENE. I have tried fetching from
>>>> ENSEMBLPEP in order to filter out ncRNAs but this still reduces the quality
>>>> of the output for analysis purposes.
>>>>
>>>> Having sifted through a good deal of the Ensembl and Ensembl Compara
>>>> Doxygen documentation, I have yet to find an accurate method that would do
>>>> this for me.
>>>>
>>>> Is there an accurate API function/method for filtering ncRNA-genes from
>>>> a list of member or gene objects?
>>>>
>>>>
>>>> Thanks in advance,
>>>>
>>>> Chris Kelly
>>>>
>>>>
>>>>
>>>> Here is the relevant section of the script code:
>>>>
>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~**~~
>>>>
>>>> $registry->load_registry_from_**db(
>>>>     -host => 'ensembldb.ensembl.org',
>>>>     -user => 'anonymous',
>>>>     -port => '5306'); # add -verbose => '1' for more verbose output
>>>>                       # add -db_version => 'version' for a specific
>>>> ensembl db version, otherwise script will search most current version
>>>>
>>>> #get member adaptor
>>>> my $member_adaptor = $registry->get_adaptor('Multi'**, 'compara',
>>>> 'Member');
>>>>
>>>> my $family_adaptor = $registry->get_adaptor('Multi'**, 'compara',
>>>> 'Family');
>>>>
>>>> my @member_list;
>>>>
>>>> #outside for-loop to iterate body of program for each id specified on
>>>> the command line
>>>> foreach my $species_genome_db_id (@genome_db_id_list){
>>>>
>>>>     @member_list = ();
>>>>     my $file_path = "$current_directory"."/"."$**species_genome_db_id";
>>>>     mkdir $file_path, 0777;
>>>>
>>>>     #Fetch all members of given species (specified by genome_db_id)
>>>> from given source.
>>>>     #Source options are: 'ENSEMBLGENE', 'ENSEMBLPEP',
>>>> 'Uniprot/SPTREMBL',
>>>>     #'Uniprot/SWISSPROT', 'ENSEMBLTRANS', 'EXTERNALCDS'.
>>>>     #Each species has a unique genome_db_id in the current ensembl
>>>> compara db version.
>>>>     sub get_members_list {
>>>>
>>>>         my($source, $genome_db_id, @members) = @_;
>>>>
>>>>         #fetch members list - returns listref of members
>>>>         my $new_members_ref = $member_adaptor->fetch_all_by_**source_genome_db_id("$source",
>>>> "$genome_db_id");
>>>>
>>>>         #dereference members list ref
>>>>         my @new_members = @$new_members_ref;
>>>>
>>>>         #join new_members list to the list of members
>>>>         push(@members, @new_members);
>>>>         @members;
>>>>     }
>>>>
>>>>     #
>>>>     #Get members from all sources for the given genome_db_id (denoting
>>>> a specific species)
>>>>     #
>>>>     #@member_list = get_members_list('ENSEMBLGENE'**,
>>>> $species_genome_db_id, @member_list);
>>>>     #print "ENSEMBLGENE members fetched\n";
>>>>     @member_list = get_members_list('ENSEMBLPEP',
>>>> $species_genome_db_id, @member_list);
>>>>     print "ENSEMBLPEP members fetched\n";
>>>>     #@member_list = get_members_list('Uniprot/**SPTREMBL',
>>>> $species_genome_db_id, @member_list);
>>>>     #print "Uniprot/SPTREMBL members fetched\n";
>>>>     #@member_list = get_members_list('Uniprot/**SWISSPROT',
>>>> $species_genome_db_id, @member_list);
>>>>     #print "Uniprot/SWISSPROT members fetched\n";
>>>>     #@member_list = get_members_list('**ENSEMBLTRANS',
>>>> $species_genome_db_id, @member_list);
>>>>     #print "ENSEMBLTRANS members fetched\n";
>>>>     #@member_list = get_members_list('EXTERNALCDS'**,
>>>> $species_genome_db_id, @member_list);
>>>>     #print "EXTERNALCDS members fetched\n";
>>>> }
>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~**~~~~~~~~~~~~
>>>> ______________________________**_________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/
>>>> **mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>> ______________________________**_________________
>>> Dev mailing list    Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/*
>>> *mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>
>> ______________________________**_________________
>> Dev mailing list    Dev at ensembl.org
>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/**
>> mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
>
> --
> Javier Herrero, PhD
> Ensembl Coordinator and Ensembl Compara Project Leader
> European Bioinformatics Institute (EMBL-EBI)
> Wellcome Trust Genome Campus, Hinxton
> Cambridge - CB10 1SD - UK
>
>
> ______________________________**_________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/**
> mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20120713/de1af55e/attachment.html>


More information about the Dev mailing list