[ensembl-dev] different figures between ensembl and biomart

Andreas Kahari ak at ebi.ac.uk
Fri Nov 19 10:40:41 GMT 2010


mysql> select count(1) from exon;
+----------+
| count(1) |
+----------+
|   225838 | 
+----------+
1 row in set (0.03 sec)

mysql> select count(1) from exon_transcript;
+----------+
| count(1) |
+----------+
|   257029 | 
+----------+
1 row in set (0.03 sec)


What's happening here is that we allow exons to be shared between
transcripts.

The exon table will hold the individual exons, and the exon_transcript
table holds the many-to-many mapping between exons and transcripts.

If you ask the API for all exons, you will get the distinct exons
(225838), but if you go through the transcripts (for each transcript,
get its exons and count them), you will get the "denormalized" count
of exons (257029).  BioMart will pull in the exons through their
transcripts, and that's why you get a higher number of exons that way.

Which is the right number?  Well, you decide depending on your
definitions.

Andreas


ps. I did not look too carefully at your code, so I won't comment on it.

On Thu, Nov 18, 2010 at 05:24:04PM +0000, Andrea Edwards wrote:
> Hi
> 
> I have some code (see below) to get all of the exons in ensembl for
> cow database release 58. I got 225838 using this method. However a
> colleague of mine accessed all of the cow  exons using biomart and
> obtained 257029.
> 
> At present i don't have access to the code that accessed biomart but
> 
> 1) is it possible to ask someone at ensembl for a definitive count
> of the number of exons in cow 58
> 2) can anyone see anything obvious in my code below why i might be
> missing any exons. I simply get all genes and all their transripts
> and all their exons. This includes all gene biotypes including
> protein coding and rna genes. My count without including RNA genes
> was 220 000 ish. I can't imagine where an extra 25000 exons can come
> from unless (and I know this is speculation) the biomart script has
> duplicates for an exon when it appears in multiple transcripts
> whereas i only get unique exons.
> 
> thanks in advance
> 
> 
> #!/usr/local/bin/perl
[cut]

-- 
Andreas Kähäri, Ensembl Software Developer
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus
Hinxton, Cambridge CB10 1SD, United Kingdom




More information about the Dev mailing list