[ensembl-dev] getting exons from database directly

Andrea Edwards edwardsa at cs.man.ac.uk
Mon May 9 16:00:18 BST 2011


Of course :) Dont i feel silly for not spotting that. Thanks a lot!

On 09/05/11 15:57, Jim Stalker wrote:
>
> On 9 May 2011, at 14:15, Andrea Edwards wrote:
>
>> Great, that explains biomart but I still don't understand why the 
>> perl code isn't returning the same as the sql. I'm using sql queries 
>> but was curious about the cause of the difference.
>
> Looks like your Perl code is counting array references (of exons) 
> rather than exons.  You get one array reference per gene, so it's the 
> same as counting genes (hence the same number as biomart, which was 
> also counting genes).
>
> compare:
>
> foreach $exon ($gene->get_all_Exons()) {
>       $exon_count++;
> }
>
> with:
> foreach $exon (@{$gene->get_all_Exons()}) {
>       $exon_count++;
> }
>
>
>> thanks
>>
>> On 09/05/11 13:08, Rhoda Kinsella wrote:
>>>
>>> Hi Andrea,
>>> When you hit the count button, this just gives a count of the number 
>>> of genes as the Ensembl mart datasets you mention are gene centric. 
>>> It will not give a count of the number of exons. To get this count, 
>>> you would have to download the result file and further process the 
>>> file to calculate the number of exons.
>>> Hope that helps
>>> Regards
>>> Rhoda
>>>
>>> On 9 May 2011, at 12:57, Andrea Edwards wrote:
>>>
>>>> Hi
>>>>
>>>> I wasn't using biomart per se; i used it as a test to try and work 
>>>> out why the perl code wasn't returning the same same numbers as the 
>>>> sql
>>>>
>>>> in both cow and human (genes 62) i didn't set any filters and 
>>>> selected exon id, gene id and transcript id attributes. For cow the 
>>>> count was 25670 and for human it was 53334
>>>>
>>>> regards
>>>>
>>>> On 09/05/11 09:05, Rhoda Kinsella wrote:
>>>>>
>>>>> Hi Andrea
>>>>> Can you send me the details of your biomart query so that I can 
>>>>> look into the issue?
>>>>> Regards
>>>>> Rhoda
>>>>>
>>>>> On 6 May 2011, at 17:04, Andrea Edwards wrote:
>>>>>
>>>>>> Hi Bert,
>>>>>>
>>>>>>
>>>>>> I agree with your code and i agree with the sql i originally 
>>>>>> posted. I don't understand why biomart and this perl code (below)
>>>>>> are only returning 25k, and it seems too coincidental they are 
>>>>>> returning the same number
>>>>>>
>>>>>> The difference in results is huge. What exons is this code 
>>>>>> missing? All i could think of was predicted exons but it seems
>>>>>> unlikely there are 25k known exons and (250k-25k = 225k) 
>>>>>> predicted exons not assigned to genes. I don't even know if 
>>>>>> ensembl deals with predicted exons.
>>>>>> I got the same 'discrepancy' with figures when i tested human too.
>>>>>>
>>>>>> ===============================================
>>>>>>
>>>>>> my $gene_adaptor = $registry->get_adaptor( 'bos_taurus', 'Core', 
>>>>>> 'Gene' );
>>>>>> my $genes = $gene_adaptor->fetch_all();
>>>>>>
>>>>>>
>>>>>> $total_genes=0;
>>>>>> $exon_count = 0;
>>>>>> foreach $gene(@{$genes}) {
>>>>>>     $total_genes++;
>>>>>>
>>>>>>     foreach $exon ($gene->get_all_Exons()) {
>>>>>>       $exon_count++;
>>>>>>     }
>>>>>> } #end for each gene
>>>>>>
>>>>>>
>>>>>> =============================================
>>>>>>
>>>>>>
>>>>>> Thank you very much
>>>>>>
>>>>>> On 06/05/11 16:44, Bert Overduin wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> When I use the following code:
>>>>>>>
>>>>>>> #!/usr/bin/perl
>>>>>>>
>>>>>>> use strict;
>>>>>>> use Bio::EnsEMBL::Registry;
>>>>>>>
>>>>>>> my $reg = "Bio::EnsEMBL::Registry";
>>>>>>>
>>>>>>> $reg->load_registry_from_db( -host => 'ensembldb.ensembl.org', 
>>>>>>> -user => 'anonymous' );
>>>>>>>
>>>>>>> my $exon_adaptor  = $reg->get_adaptor( 'Bos taurus', 'Core', 
>>>>>>> 'Exon' );
>>>>>>>
>>>>>>> my $exons = $exon_adaptor->fetch_all;
>>>>>>>
>>>>>>> print scalar( @{$exons} ), "\n";
>>>>>>>
>>>>>>> I get:
>>>>>>>
>>>>>>> farm2-head2[bert]2: perl test.pl
>>>>>>> 225837
>>>>>>>
>>>>>>> Which is the same number I get with a MySQL query:
>>>>>>>
>>>>>>> mysql -u anonymous -h ensembldb.ensembl.org -P 5306
>>>>>>> Welcome to the MySQL monitor.  Commands end with ; or \g.
>>>>>>> Your MySQL connection id is 8610 to server version: 5.1.34-log
>>>>>>>
>>>>>>> Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
>>>>>>>
>>>>>>> mysql> use bos_taurus_core_62_4k
>>>>>>> Reading table information for completion of table and column names
>>>>>>> You can turn off this feature to get a quicker startup with -A
>>>>>>>
>>>>>>> Database changed
>>>>>>> mysql> SELECT COUNT(*) FROM exon;
>>>>>>> +----------+
>>>>>>> | COUNT(*) |
>>>>>>> +----------+
>>>>>>> |   225837 |
>>>>>>> +----------+
>>>>>>> 1 row in set (0.01 sec)
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Bert
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 6, 2011 at 4:22 PM, Andrea Edwards 
>>>>>>> <edwardsa at cs.man.ac.uk> wrote:
>>>>>>> I tried 2 ways :
>>>>>>>
>>>>>>> ===============================================
>>>>>>>
>>>>>>> my $gene_adaptor = $registry->get_adaptor( 'bos_taurus', 'Core', 
>>>>>>> 'Gene' );
>>>>>>> my $genes = $gene_adaptor->fetch_all();
>>>>>>>
>>>>>>> my $exon_adaptor = $registry->get_adaptor( 'bos_taurus', 'Core', 
>>>>>>> 'Exon' );
>>>>>>> $total_genes=0;
>>>>>>> $exon_count = 0;
>>>>>>> foreach $gene(@{$genes}) {
>>>>>>>     $total_genes++;
>>>>>>>
>>>>>>>     foreach $exon ($gene->get_all_Exons()) {
>>>>>>>       $exon_count++;
>>>>>>>     }
>>>>>>> } #end for each gene
>>>>>>>
>>>>>>>
>>>>>>> =============================================
>>>>>>>
>>>>>>> This way gave even less (23k) but i'm being stricter here about 
>>>>>>> the chromosomes
>>>>>>>
>>>>>>> @slices = @{ $slice_adaptor->fetch_all('chromosome', undef, 0, 
>>>>>>> 1) };
>>>>>>>
>>>>>>> $total_genes=0;
>>>>>>> $exon_count = 0;
>>>>>>> foreach $slice (@slices) {
>>>>>>>     unless ($slice->seq_region_name() =~ /Un/) {
>>>>>>>         print $slice->seq_region_name."\n";
>>>>>>>         my $genes = $gene_adaptor->fetch_all_by_Slice($slice);
>>>>>>>
>>>>>>>
>>>>>>>         foreach my $gene(@{$genes}) {
>>>>>>>             $total_genes++;
>>>>>>>
>>>>>>>             foreach my $exon ($gene->get_all_Exons()) {
>>>>>>>                   $exon_count++;
>>>>>>>                   print "$exon_count\n";
>>>>>>>             }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>         } #end for each gene
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>> ==============================================
>>>>>>>
>>>>>>> But neither give anything like the sql results
>>>>>>>
>>>>>>> Why does the sql give so many more? Which should I use?
>>>>>>>
>>>>>>> thank you
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 06/05/11 15:50, Bert Overduin wrote:
>>>>>>>>
>>>>>>>> Hi Andrea,
>>>>>>>>
>>>>>>>> I suspect that your BioMart results are truncated because the 
>>>>>>>> query is too large.
>>>>>>>>
>>>>>>>> However, that doesn't explain your API results .... How does 
>>>>>>>> your API code look like?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Bert
>>>>>>>>
>>>>>>>> On Fri, May 6, 2011 at 3:45 PM, Andrea Edwards 
>>>>>>>> <edwardsa at cs.man.ac.uk> wrote:
>>>>>>>> Hello
>>>>>>>>
>>>>>>>> I'm sorry for the basic question but I was looking at the 
>>>>>>>> ensembl core schema and trying to retrieve just the exons on 
>>>>>>>> chromosomes and couldn't work out why i am getting such 
>>>>>>>> different figures than with biomart and the perl api
>>>>>>>>
>>>>>>>> For example for cow there are 25670 exons in genes with biomart 
>>>>>>>> and the api but with this sql  ~210k exons. This code is just 
>>>>>>>> looking for exons on chromosomes 1-30 and X
>>>>>>>>
>>>>>>>> select count(distinct stable_id) from exon e inner join 
>>>>>>>> exon_stable_id es using(exon_id) inner join seq_region sr 
>>>>>>>> using(seq_region_id) where sr.coord_system_id = 2 and 
>>>>>>>> sr.nameREGEXP '^[1-9]|^X'  and e.is_current=1
>>>>>>>>
>>>>>>>> I get 8k just on chromosome 1
>>>>>>>>
>>>>>>>> I'm sure this is simple and perhaps its because its Friday 
>>>>>>>> afternoon but I'm just not seeing it!!
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>>>> List admin (including subscribe/unsubscribe): 
>>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Bert Overduin, Ph.D.
>>>>>>>> Vertebrate Genomics Team
>>>>>>>>
>>>>>>>> EMBL - European Bioinformatics Institute
>>>>>>>> Wellcome Trust Genome Campus
>>>>>>>> Hinxton, Cambridge CB10 1SD
>>>>>>>> United Kingdom
>>>>>>>>
>>>>>>>> http://www.ebi.ac.uk/~bert
>>>>>>>>
>>>>>>>> Ensembl browser: http://www.ensembl.org
>>>>>>>> Mailing lists: 
>>>>>>>> http://www.ensembl.org/info/about/contact/mailing.html
>>>>>>>> Blog: http://www.ensembl.info
>>>>>>>> YouTube: http://www.youtube.com/user/EnsemblHelpdesk
>>>>>>>> Facebook: http://www.facebook.com/Ensembl.org
>>>>>>>> Twitter: http://twitter.com/Ensembl
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Bert Overduin, Ph.D.
>>>>>>> Vertebrate Genomics Team
>>>>>>>
>>>>>>> EMBL - European Bioinformatics Institute
>>>>>>> Wellcome Trust Genome Campus
>>>>>>> Hinxton, Cambridge CB10 1SD
>>>>>>> United Kingdom
>>>>>>>
>>>>>>> http://www.ebi.ac.uk/~bert
>>>>>>>
>>>>>>> Ensembl browser: http://www.ensembl.org
>>>>>>> Mailing lists: 
>>>>>>> http://www.ensembl.org/info/about/contact/mailing.html
>>>>>>> Blog: http://www.ensembl.info
>>>>>>> YouTube: http://www.youtube.com/user/EnsemblHelpdesk
>>>>>>> Facebook: http://www.facebook.com/Ensembl.org
>>>>>>> Twitter: http://twitter.com/Ensembl
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>> List admin (including subscribe/unsubscribe): 
>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>
>>>>> Rhoda Kinsella Ph.D.
>>>>> Ensembl Bioinformatician,
>>>>> European Bioinformatics Institute (EMBL-EBI),
>>>>> Wellcome Trust Genome Campus,
>>>>> Hinxton
>>>>> Cambridge CB10 1SD,
>>>>> UK.
>>>>>
>>>>
>>>
>>> Rhoda Kinsella Ph.D.
>>> Ensembl Bioinformatician,
>>> European Bioinformatics Institute (EMBL-EBI),
>>> Wellcome Trust Genome Campus,
>>> Hinxton
>>> Cambridge CB10 1SD,
>>> UK.
>>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> List admin (including subscribe/unsubscribe): 
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>





More information about the Dev mailing list