[ensembl-dev] Bug in Ensembl BioMart - missing genes if some attributes are selected

Rhoda Kinsella rhoda at ebi.ac.uk
Mon Oct 28 13:42:07 GMT 2013


Hi Thomas
Thanks for your email. Unfortunately, this is a very well known issue caused by a limitation in the BioMart software that we use to build the mart databases. Below is a detailed description of what is going on behind the scenes which was sent to a user who had a similar query to yours: 


> In the Ensembl Mart, the main dataset is made up of three main tables,
> each with a number of associated dimension (dm) tables.
> 
> The first main table is built from the gene table in the Ensembl schema.
> With it goes all the information directly associated with the gene
> table, such as cross-references assigned directly to genes (as opposed
> to transcripts or translations).
> 
> The second main table inherits its fields from the gene main table
> and adds the transcript-related data, cross-references specifically
> on transcripts etc.  When building the transcript main table, it also
> inherits the data from the gene main table.  More specifically, the
> transcript main table contains the data from the gene table for all
> genes that have transcripts, i.e. all genes.
> 
> Depending on what the user asks for, the gene or the transcript main
> table will be used.  Asking for HGNC symbols and transcript stable IDs,
> for example, will use the transcript main table since the transcript
> stable IDs are not available in the gene main table (the HGNC symbols
> will be joined in from the cross-reference dimension table hanging
> off the gene main table using the gene key which is available in the
> transcript main table).
> 
> The third main table inherits its structure from the second main table,
> which means that it contains all the fields from the gene main table and
> the transcript main table, and then it adds the specific fields for the
> translations, cross-references specifically on translations etc.  It
> contains all the data from the transcript main table, but, and this is
> important, only for the transcripts that have translations, i.e. *not*
> all transcripts.
> 
> When you ask for the Swissprot ID (or any other external reference mapped to translations e.g. GO ID, GOSlim ID, EMBL ID or HPA ID in human), which is a
> cross-reference associated with translations, the main table that will
> be involved must be the translation main table.  Since the translation
> main table only contains data for the transcripts (and genes) that have
> translations, filtering in such a way that only non-coding genes are
> considered will ensure that no data is returned.  This is what happens
> when you, for example, filter for genes overlapping a particular probe,
> and the only genes that does so are non-coding genes.
> 
> This is the way that the MartBuilder tool works when inheriting table
> structures for multiple main tables in a dataset, and it has nothing
> to do with joining (that we can influence).  It is a deficiency in the
> BioMart software that we are aware of and you are not the first user to
> stumble upon it.  We will obviously make sure that the BioMart team in
> Toronto are made aware of this (again).
> 



I hope it explains the issue but do email again if you have further questions. We will contact the BioMart team again and see if they have a plan to fix this issue.
Kind regards
Rhoda



On 25 Oct 2013, at 22:09, "Thomas Danhorn" <danhornt at njhealth.org> wrote:

> There is bug in Ensembl BioMart that omits certain genes from the results, when certain attributes are chosen.  This affects all versions of Ensembl I looked at, both the current one (73) and the archive (tested 65 and 67), for at least mouse and human (did not check other organisms).
> 
> The effect is severe, tens of thousands of genes will not show up in the output of the results, but there is no indication of this until the output is examined - the "Count"-button will always show the expected number of genes.  Note that I am not complaining about empty fields for certain attributes, but about the disappearance of whole lines (i.e. genes/transcripts) when (and only when) those attributes are chosen.
> 
> The affected attributes include all of the GO and GOSlim GOA fields, several of the "External References" (e.g. EMBL (Genbank) ID and Human Protein Atlas Antibody ID, but not the various Ensemble and VEGA fields, PUBMED ID, or UCSC ID), and most of the "PROTEIN DOMAINS" information (but not the Ensemble fields).  All the "GENE", "EXPRESSION", and "Microarray Probes" fields I looked at don't seem to cause this problem.
> 
> Steps to repoduce:
> - Go to the Ensebl BioMart website (current or archive).  Choose "Ensemble Genes [version]" and an organism (mouse, human).
> - Click in "Filters".  Check "ID List Limit" and enter ENSG00000002079 (for human) or ENSMUSG00000000031 (for mouse) in the box (with "Ensemble Gene ID(s)" selected).  Note that this filtration step has nothing to do with the problem per se and can be left out, but it reduces the results to 0 vs 1 as opposed to e.g. 24k vs 55k.  The gene IDs I list here are among the thousands that show this behavior; if you add others, those might or might not.
> - Click the "Count"-button - it will show 1 gene.
> - Click "Results" - you will see a line for each transcript (default attributes are Ensemble gene and transcript IDs), as expected.
> - Now click "Attributes" and check "GO Term Accession" (or any of the other problematic ones listed above).
> - Click "Count" again - this will still show 1 gene.
> - Click "Results", and you will see a header line with your selected attributes (like before), but no lines for the gene/transcripts.  If you download the results, the file will only contain the header.  I would expect to see here the same output (one line per transcript) as above, with an extra (perhaps empty) column for the "GO Term Accession" (or whatever other problematic attribute you selected).
> 
> I have observed this behavior over several days/weeks, so don't think this is a momentary server glitch.  I doubt that it makes a difference, but I am using Firefox 20.0 on Linux.  Please let me know if you need additional information (more problematic gene IDs, etc.).  I could not check if this has been reported before (due to the server problems the mail archive is still down), but I hope this information is helpful.
> 
> Thanks,
> 
> Thomas
> 
> 
> NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

Rhoda Kinsella Ph.D.
Ensembl Production Project Leader
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton,
Cambridge
CB10 1SD




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20131028/45f4e625/attachment.html>


More information about the Dev mailing list