[ensembl-dev] getting exons from database directly

Andrea Edwards edwardsa at cs.man.ac.uk
Mon May 9 14:15:40 BST 2011


Great, that explains biomart but I still don't understand why the perl 
code isn't returning the same as the sql. I'm using sql queries but was 
curious about the cause of the difference.
thanks

On 09/05/11 13:08, Rhoda Kinsella wrote:
> Hi Andrea,
> When you hit the count button, this just gives a count of the number 
> of genes as the Ensembl mart datasets you mention are gene centric. It 
> will not give a count of the number of exons. To get this count, you 
> would have to download the result file and further process the file to 
> calculate the number of exons.
> Hope that helps
> Regards
> Rhoda
>
> On 9 May 2011, at 12:57, Andrea Edwards wrote:
>
>> Hi
>>
>> I wasn't using biomart per se; i used it as a test to try and work 
>> out why the perl code wasn't returning the same same numbers as the sql
>>
>> in both cow and human (genes 62) i didn't set any filters and 
>> selected exon id, gene id and transcript id attributes. For cow the 
>> count was 25670 and for human it was 53334
>>
>> regards
>>
>> On 09/05/11 09:05, Rhoda Kinsella wrote:
>>> Hi Andrea
>>> Can you send me the details of your biomart query so that I can look 
>>> into the issue?
>>> Regards
>>> Rhoda
>>>
>>> On 6 May 2011, at 17:04, Andrea Edwards wrote:
>>>
>>>> Hi Bert,
>>>>
>>>>
>>>> I agree with your code and i agree with the sql i originally 
>>>> posted. I don't understand why biomart and this perl code (below)
>>>> are only returning 25k, and it seems too coincidental they are 
>>>> returning the same number
>>>>
>>>> The difference in results is huge. What exons is this code missing? 
>>>> All i could think of was predicted exons but it seems
>>>> unlikely there are 25k known exons and (250k-25k = 225k) predicted 
>>>> exons not assigned to genes. I don't even know if ensembl deals 
>>>> with predicted exons.
>>>> I got the same 'discrepancy' with figures when i tested human too.
>>>>
>>>> ===============================================
>>>>
>>>> my $gene_adaptor = $registry->get_adaptor( 'bos_taurus', 'Core', 
>>>> 'Gene' );
>>>> my $genes = $gene_adaptor->fetch_all();
>>>>
>>>>
>>>> $total_genes=0;
>>>> $exon_count = 0;
>>>> foreach $gene(@{$genes}) {
>>>>     $total_genes++;
>>>>
>>>>     foreach $exon ($gene->get_all_Exons()) {
>>>>       $exon_count++;
>>>>     }
>>>> } #end for each gene
>>>>
>>>>
>>>> =============================================
>>>>
>>>>
>>>> Thank you very much
>>>>
>>>> On 06/05/11 16:44, Bert Overduin wrote:
>>>>> Hi,
>>>>>
>>>>> When I use the following code:
>>>>>
>>>>> #!/usr/bin/perl
>>>>>
>>>>> use strict;
>>>>> use Bio::EnsEMBL::Registry;
>>>>>
>>>>> my $reg = "Bio::EnsEMBL::Registry";
>>>>>
>>>>> $reg->load_registry_from_db( -host => 'ensembldb.ensembl.org 
>>>>> <http://ensembldb.ensembl.org>', -user => 'anonymous' );
>>>>>
>>>>> my $exon_adaptor  = $reg->get_adaptor( 'Bos taurus', 'Core', 'Exon' );
>>>>>
>>>>> my $exons = $exon_adaptor->fetch_all;
>>>>>
>>>>> print scalar( @{$exons} ), "\n";
>>>>>
>>>>> I get:
>>>>>
>>>>> farm2-head2[bert]2: perltest.pl <http://test.pl>
>>>>> 225837
>>>>>
>>>>> Which is the same number I get with a MySQL query:
>>>>>
>>>>> mysql -u anonymous -hensembldb.ensembl.org 
>>>>> <http://ensembldb.ensembl.org>-P 5306
>>>>> Welcome to the MySQL monitor.  Commands end with ; or \g.
>>>>> Your MySQL connection id is 8610 to server version: 5.1.34-log
>>>>>
>>>>> Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
>>>>>
>>>>> mysql> use bos_taurus_core_62_4k
>>>>> Reading table information for completion of table and column names
>>>>> You can turn off this feature to get a quicker startup with -A
>>>>>
>>>>> Database changed
>>>>> mysql> SELECT COUNT(*) FROM exon;
>>>>> +----------+
>>>>> | COUNT(*) |
>>>>> +----------+
>>>>> |   225837 |
>>>>> +----------+
>>>>> 1 row in set (0.01 sec)
>>>>>
>>>>> Cheers,
>>>>> Bert
>>>>>
>>>>>
>>>>> On Fri, May 6, 2011 at 4:22 PM, Andrea 
>>>>> Edwards<edwardsa at cs.man.ac.uk <mailto:edwardsa at cs.man.ac.uk>>wrote:
>>>>>
>>>>>     I tried 2 ways :
>>>>>
>>>>>     ===============================================
>>>>>
>>>>>     my $gene_adaptor = $registry->get_adaptor( 'bos_taurus',
>>>>>     'Core', 'Gene' );
>>>>>     my $genes = $gene_adaptor->fetch_all();
>>>>>
>>>>>     my $exon_adaptor = $registry->get_adaptor( 'bos_taurus',
>>>>>     'Core', 'Exon' );
>>>>>     $total_genes=0;
>>>>>     $exon_count = 0;
>>>>>     foreach $gene(@{$genes}) {
>>>>>         $total_genes++;
>>>>>
>>>>>         foreach $exon ($gene->get_all_Exons()) {
>>>>>           $exon_count++;
>>>>>         }
>>>>>     } #end for each gene
>>>>>
>>>>>
>>>>>     =============================================
>>>>>
>>>>>     This way gave even less (23k) but i'm being stricter here
>>>>>     about the chromosomes
>>>>>
>>>>>     @slices = @{ $slice_adaptor->fetch_all('chromosome', undef, 0,
>>>>>     1) };
>>>>>
>>>>>     $total_genes=0;
>>>>>     $exon_count = 0;
>>>>>     foreach $slice (@slices) {
>>>>>         unless ($slice->seq_region_name() =~ /Un/) {
>>>>>             print $slice->seq_region_name."\n";
>>>>>             my $genes = $gene_adaptor->fetch_all_by_Slice($slice);
>>>>>
>>>>>
>>>>>             foreach my $gene(@{$genes}) {
>>>>>                 $total_genes++;
>>>>>
>>>>>                 foreach my $exon ($gene->get_all_Exons()) {
>>>>>                       $exon_count++;
>>>>>                       print "$exon_count\n";
>>>>>                 }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>             } #end for each gene
>>>>>         }
>>>>>     }
>>>>>
>>>>>     ==============================================
>>>>>
>>>>>     But neither give anything like the sql results
>>>>>
>>>>>     Why does the sql give so many more? Which should I use?
>>>>>
>>>>>     thank you
>>>>>
>>>>>
>>>>>
>>>>>     On 06/05/11 15:50, Bert Overduin wrote:
>>>>>>     Hi Andrea,
>>>>>>
>>>>>>     I suspect that your BioMart results are truncated because the
>>>>>>     query is too large.
>>>>>>
>>>>>>     However, that doesn't explain your API results .... How does
>>>>>>     your API code look like?
>>>>>>
>>>>>>     Cheers,
>>>>>>     Bert
>>>>>>
>>>>>>     On Fri, May 6, 2011 at 3:45 PM, Andrea
>>>>>>     Edwards<edwardsa at cs.man.ac.uk
>>>>>>     <mailto:edwardsa at cs.man.ac.uk>>wrote:
>>>>>>
>>>>>>         Hello
>>>>>>
>>>>>>         I'm sorry for the basic question but I was looking at the
>>>>>>         ensembl core schema and trying to retrieve just the exons
>>>>>>         on chromosomes and couldn't work out why i am getting
>>>>>>         such different figures than with biomart and the perl api
>>>>>>
>>>>>>         For example for cow there are 25670 exons in genes with
>>>>>>         biomart and the api but with this sql  ~210k exons. This
>>>>>>         code is just looking for exons on chromosomes 1-30 and X
>>>>>>
>>>>>>         select count(distinct stable_id) from exon e inner join
>>>>>>         exon_stable_id es using(exon_id) inner join seq_region sr
>>>>>>         using(seq_region_id) where sr.coord_system_id = 2
>>>>>>         andsr.name <http://sr.name>REGEXP '^[1-9]|^X'  and
>>>>>>         e.is_current=1
>>>>>>
>>>>>>         I get 8k just on chromosome 1
>>>>>>
>>>>>>         I'm sure this is simple and perhaps its because its
>>>>>>         Friday afternoon but I'm just not seeing it!!
>>>>>>
>>>>>>         _______________________________________________
>>>>>>         Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>>>>         List admin (including
>>>>>>         subscribe/unsubscribe):http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>         Ensembl Blog:http://www.ensembl.info/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>     --
>>>>>>     Bert Overduin, Ph.D.
>>>>>>     Vertebrate Genomics Team
>>>>>>
>>>>>>     EMBL - European Bioinformatics Institute
>>>>>>     Wellcome Trust Genome Campus
>>>>>>     Hinxton, Cambridge CB10 1SD
>>>>>>     United Kingdom
>>>>>>
>>>>>>     http://www.ebi.ac.uk/~bert <http://www.ebi.ac.uk/%7Ebert>
>>>>>>
>>>>>>     Ensembl browser: http://www.ensembl.org <http://www.ensembl.org/>
>>>>>>     Mailing lists:
>>>>>>     http://www.ensembl.org/info/about/contact/mailing.html
>>>>>>     Blog: http://www.ensembl.info <http://www.ensembl.info/>
>>>>>>     YouTube: http://www.youtube.com/user/EnsemblHelpdesk
>>>>>>     Facebook: http://www.facebook.com/Ensembl.org
>>>>>>     Twitter: http://twitter.com/Ensembl
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bert Overduin, Ph.D.
>>>>> Vertebrate Genomics Team
>>>>>
>>>>> EMBL - European Bioinformatics Institute
>>>>> Wellcome Trust Genome Campus
>>>>> Hinxton, Cambridge CB10 1SD
>>>>> United Kingdom
>>>>>
>>>>> http://www.ebi.ac.uk/~bert <http://www.ebi.ac.uk/%7Ebert>
>>>>>
>>>>> Ensembl browser: http://www.ensembl.org <http://www.ensembl.org/>
>>>>> Mailing lists: http://www.ensembl.org/info/about/contact/mailing.html
>>>>> Blog: http://www.ensembl.info <http://www.ensembl.info/>
>>>>> YouTube: http://www.youtube.com/user/EnsemblHelpdesk
>>>>> Facebook: http://www.facebook.com/Ensembl.org
>>>>> Twitter: http://twitter.com/Ensembl
>>>>>
>>>>
>>>> _______________________________________________
>>>> Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>> List admin (including 
>>>> subscribe/unsubscribe):http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog:http://www.ensembl.info/
>>>
>>> Rhoda Kinsella Ph.D.
>>> Ensembl Bioinformatician,
>>> European Bioinformatics Institute (EMBL-EBI),
>>> Wellcome Trust Genome Campus,
>>> Hinxton
>>> Cambridge CB10 1SD,
>>> UK.
>>>
>>
>
> Rhoda Kinsella Ph.D.
> Ensembl Bioinformatician,
> European Bioinformatics Institute (EMBL-EBI),
> Wellcome Trust Genome Campus,
> Hinxton
> Cambridge CB10 1SD,
> UK.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20110509/9113f483/attachment.html>


More information about the Dev mailing list