[ensembl-dev] getting exons from database directly

Rhoda Kinsella rhoda at ebi.ac.uk
Mon May 9 13:08:43 BST 2011


Hi Andrea,
When you hit the count button, this just gives a count of the number  
of genes as the Ensembl mart datasets you mention are gene centric. It  
will not give a count of the number of exons. To get this count, you  
would have to download the result file and further process the file to  
calculate the number of exons.
Hope that helps
Regards
Rhoda

On 9 May 2011, at 12:57, Andrea Edwards wrote:

> Hi
>
> I wasn't using biomart per se; i used it as a test to try and work  
> out why the perl code wasn't returning the same same numbers as the  
> sql
>
> in both cow and human (genes 62) i didn't set any filters and  
> selected exon id, gene id and transcript id attributes. For cow the  
> count was 25670 and for human it was 53334
>
> regards
>
> On 09/05/11 09:05, Rhoda Kinsella wrote:
>>
>> Hi Andrea
>> Can you send me the details of your biomart query so that I can  
>> look into the issue?
>> Regards
>> Rhoda
>>
>> On 6 May 2011, at 17:04, Andrea Edwards wrote:
>>
>>> Hi Bert,
>>>
>>>
>>> I agree with your code and i agree with the sql i originally  
>>> posted. I don't understand why biomart and this perl code (below)
>>> are only returning 25k, and it seems too coincidental they are  
>>> returning the same number
>>>
>>> The difference in results is huge. What exons is this code  
>>> missing? All i could think of was predicted exons but it seems
>>> unlikely there are 25k known exons and (250k-25k = 225k) predicted  
>>> exons not assigned to genes. I don't even know if ensembl deals  
>>> with predicted exons.
>>> I got the same 'discrepancy' with figures when i tested human too.
>>>
>>> ===============================================
>>>
>>> my $gene_adaptor = $registry->get_adaptor( 'bos_taurus', 'Core',  
>>> 'Gene' );
>>> my $genes = $gene_adaptor->fetch_all();
>>>
>>>
>>> $total_genes=0;
>>> $exon_count = 0;
>>> foreach $gene(@{$genes}) {
>>>     $total_genes++;
>>>
>>>     foreach $exon ($gene->get_all_Exons()) {
>>>       $exon_count++;
>>>     }
>>> } #end for each gene
>>>
>>>
>>> =============================================
>>>
>>>
>>> Thank you very much
>>>
>>> On 06/05/11 16:44, Bert Overduin wrote:
>>>>
>>>> Hi,
>>>>
>>>> When I use the following code:
>>>>
>>>> #!/usr/bin/perl
>>>>
>>>> use strict;
>>>> use Bio::EnsEMBL::Registry;
>>>>
>>>> my $reg = "Bio::EnsEMBL::Registry";
>>>>
>>>> $reg->load_registry_from_db( -host => 'ensembldb.ensembl.org', - 
>>>> user => 'anonymous' );
>>>>
>>>> my $exon_adaptor  = $reg->get_adaptor( 'Bos taurus', 'Core',  
>>>> 'Exon' );
>>>>
>>>> my $exons = $exon_adaptor->fetch_all;
>>>>
>>>> print scalar( @{$exons} ), "\n";
>>>>
>>>> I get:
>>>>
>>>> farm2-head2[bert]2: perl test.pl
>>>> 225837
>>>>
>>>> Which is the same number I get with a MySQL query:
>>>>
>>>> mysql -u anonymous -h ensembldb.ensembl.org -P 5306
>>>> Welcome to the MySQL monitor.  Commands end with ; or \g.
>>>> Your MySQL connection id is 8610 to server version: 5.1.34-log
>>>>
>>>> Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
>>>>
>>>> mysql> use bos_taurus_core_62_4k
>>>> Reading table information for completion of table and column names
>>>> You can turn off this feature to get a quicker startup with -A
>>>>
>>>> Database changed
>>>> mysql> SELECT COUNT(*) FROM exon;
>>>> +----------+
>>>> | COUNT(*) |
>>>> +----------+
>>>> |   225837 |
>>>> +----------+
>>>> 1 row in set (0.01 sec)
>>>>
>>>> Cheers,
>>>> Bert
>>>>
>>>>
>>>> On Fri, May 6, 2011 at 4:22 PM, Andrea Edwards <edwardsa at cs.man.ac.uk 
>>>> > wrote:
>>>> I tried 2 ways :
>>>>
>>>> ===============================================
>>>>
>>>> my $gene_adaptor = $registry->get_adaptor( 'bos_taurus', 'Core',  
>>>> 'Gene' );
>>>> my $genes = $gene_adaptor->fetch_all();
>>>>
>>>> my $exon_adaptor = $registry->get_adaptor( 'bos_taurus', 'Core',  
>>>> 'Exon' );
>>>> $total_genes=0;
>>>> $exon_count = 0;
>>>> foreach $gene(@{$genes}) {
>>>>     $total_genes++;
>>>>
>>>>     foreach $exon ($gene->get_all_Exons()) {
>>>>       $exon_count++;
>>>>     }
>>>> } #end for each gene
>>>>
>>>>
>>>> =============================================
>>>>
>>>> This way gave even less (23k) but i'm being stricter here about  
>>>> the chromosomes
>>>>
>>>> @slices = @{ $slice_adaptor->fetch_all('chromosome', undef, 0,  
>>>> 1) };
>>>>
>>>> $total_genes=0;
>>>> $exon_count = 0;
>>>> foreach $slice (@slices) {
>>>>     unless ($slice->seq_region_name() =~ /Un/) {
>>>>         print $slice->seq_region_name."\n";
>>>>         my $genes = $gene_adaptor->fetch_all_by_Slice($slice);
>>>>
>>>>
>>>>         foreach my $gene(@{$genes}) {
>>>>             $total_genes++;
>>>>
>>>>             foreach my $exon ($gene->get_all_Exons()) {
>>>>                   $exon_count++;
>>>>                   print "$exon_count\n";
>>>>             }
>>>>
>>>>
>>>>
>>>>
>>>>         } #end for each gene
>>>>     }
>>>> }
>>>>
>>>> ==============================================
>>>>
>>>> But neither give anything like the sql results
>>>>
>>>> Why does the sql give so many more? Which should I use?
>>>>
>>>> thank you
>>>>
>>>>
>>>>
>>>> On 06/05/11 15:50, Bert Overduin wrote:
>>>>>
>>>>> Hi Andrea,
>>>>>
>>>>> I suspect that your BioMart results are truncated because the  
>>>>> query is too large.
>>>>>
>>>>> However, that doesn't explain your API results .... How does  
>>>>> your API code look like?
>>>>>
>>>>> Cheers,
>>>>> Bert
>>>>>
>>>>> On Fri, May 6, 2011 at 3:45 PM, Andrea Edwards <edwardsa at cs.man.ac.uk 
>>>>> > wrote:
>>>>> Hello
>>>>>
>>>>> I'm sorry for the basic question but I was looking at the  
>>>>> ensembl core schema and trying to retrieve just the exons on  
>>>>> chromosomes and couldn't work out why i am getting such  
>>>>> different figures than with biomart and the perl api
>>>>>
>>>>> For example for cow there are 25670 exons in genes with biomart  
>>>>> and the api but with this sql  ~210k exons. This code is just  
>>>>> looking for exons on chromosomes 1-30 and X
>>>>>
>>>>> select count(distinct stable_id) from exon e inner join  
>>>>> exon_stable_id es using(exon_id) inner join seq_region sr  
>>>>> using(seq_region_id) where sr.coord_system_id = 2 and  
>>>>> sr.nameREGEXP '^[1-9]|^X'  and e.is_current=1
>>>>>
>>>>> I get 8k just on chromosome 1
>>>>>
>>>>> I'm sure this is simple and perhaps its because its Friday  
>>>>> afternoon but I'm just not seeing it!!
>>>>>
>>>>> _______________________________________________
>>>>> Dev mailing list    Dev at ensembl.org
>>>>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Bert Overduin, Ph.D.
>>>>> Vertebrate Genomics Team
>>>>>
>>>>> EMBL - European Bioinformatics Institute
>>>>> Wellcome Trust Genome Campus
>>>>> Hinxton, Cambridge CB10 1SD
>>>>> United Kingdom
>>>>>
>>>>> http://www.ebi.ac.uk/~bert
>>>>>
>>>>> Ensembl browser: http://www.ensembl.org
>>>>> Mailing lists: http://www.ensembl.org/info/about/contact/mailing.html
>>>>> Blog: http://www.ensembl.info
>>>>> YouTube: http://www.youtube.com/user/EnsemblHelpdesk
>>>>> Facebook: http://www.facebook.com/Ensembl.org
>>>>> Twitter: http://twitter.com/Ensembl
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> Bert Overduin, Ph.D.
>>>> Vertebrate Genomics Team
>>>>
>>>> EMBL - European Bioinformatics Institute
>>>> Wellcome Trust Genome Campus
>>>> Hinxton, Cambridge CB10 1SD
>>>> United Kingdom
>>>>
>>>> http://www.ebi.ac.uk/~bert
>>>>
>>>> Ensembl browser: http://www.ensembl.org
>>>> Mailing lists: http://www.ensembl.org/info/about/contact/mailing.html
>>>> Blog: http://www.ensembl.info
>>>> YouTube: http://www.youtube.com/user/EnsemblHelpdesk
>>>> Facebook: http://www.facebook.com/Ensembl.org
>>>> Twitter: http://twitter.com/Ensembl
>>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>
>> Rhoda Kinsella Ph.D.
>> Ensembl Bioinformatician,
>> European Bioinformatics Institute (EMBL-EBI),
>> Wellcome Trust Genome Campus,
>> Hinxton
>> Cambridge CB10 1SD,
>> UK.
>>
>

Rhoda Kinsella Ph.D.
Ensembl Bioinformatician,
European Bioinformatics Institute (EMBL-EBI),
Wellcome Trust Genome Campus,
Hinxton
Cambridge CB10 1SD,
UK.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20110509/0cc1f05b/attachment.html>


More information about the Dev mailing list