[ensembl-dev] Bug in Ensembl BioMart - missing genes if some attributes are selected

Mon Oct 28 16:01:35 GMT 2013

Hi Rhoda,

Thank you very much for the thorough explanation.  While the BioMart 
people are working on this (I hope), would it be possible to put a 
disclaimer on the Ensembl BioMart pages (maybe the Attribute selection 
page) that hints at the issue?  I suggest pasting the e-mail your 
forwarded to me into a FAQ entry and putting on the BioMart page something 
like:

Warning: A bug in the BioMart software will restrict your results to genes 
that have associated protein sequences if you select certain 
translation-related attributes, even though the count will show the number 
of all genes passing the filters.  See the FAQ [link] for more 
information.

It would be a bonus to mark the problematic (translation-related) 
attributes, e.g. with an asterisk, and mention that in the disclaimer, but 
even a simple sentence might save the unsuspecting user several hours 
trial and error downloading data (and wasting bandwidth in the process).

Thanks,

Thomas

On Mon, 28 Oct 2013, Rhoda Kinsella wrote:

> Hi ThomasThanks for your email. Unfortunately, this is a very well known
> issue caused by a limitation in the BioMart software that we use to build
> the mart databases. Below is a detailed description of what is going on
> behind the scenes which was sent to a user who had a similar query to
> yours: 
> 
>
>       In the Ensembl Mart, the main dataset is made up of three main
>       tables,
>       each with a number of associated dimension (dm) tables.
>
>       The first main table is built from the gene table in the Ensembl
>       schema.
>       With it goes all the information directly associated with the
>       gene
>       table, such as cross-references assigned directly to genes (as
>       opposed
>       to transcripts or translations).
>
>       The second main table inherits its fields from the gene main
>       table
>       and adds the transcript-related data, cross-references
>       specifically
>       on transcripts etc.  When building the transcript main table, it
>       also
>       inherits the data from the gene main table.  More specifically,
>       the
>       transcript main table contains the data from the gene table for
>       all
>       genes that have transcripts, i.e. all genes.
>
>       Depending on what the user asks for, the gene or the transcript
>       main
>       table will be used.  Asking for HGNC symbols and transcript
>       stable IDs,
>       for example, will use the transcript main table since the
>       transcript
>       stable IDs are not available in the gene main table (the HGNC
>       symbols
>       will be joined in from the cross-reference dimension table
>       hanging
>       off the gene main table using the gene key which is available in
>       the
>       transcript main table).
>
>       The third main table inherits its structure from the second main
>       table,
>       which means that it contains all the fields from the gene main
>       table and
>       the transcript main table, and then it adds the specific fields
>       for the
>       translations, cross-references specifically on translations etc.
>        It
>       contains all the data from the transcript main table, but, and
>       this is
>       important, only for the transcripts that have translations, i.e.
>       *not*
>       all transcripts.
>
>       When you ask for the Swissprot ID (or any other external
>       reference mapped to translations e.g. GO ID, GOSlim ID, EMBL ID
>       or HPA ID in human), which is a
>       cross-reference associated with translations, the main table
>       that will
>       be involved must be the translation main table.  Since the
>       translation
>       main table only contains data for the transcripts (and genes)
>       that have
>       translations, filtering in such a way that only non-coding genes
>       are
>       considered will ensure that no data is returned.  This is what
>       happens
>       when you, for example, filter for genes overlapping a particular
>       probe,
>       and the only genes that does so are non-coding genes.
>
>       This is the way that the MartBuilder tool works when inheriting
>       table
>       structures for multiple main tables in a dataset, and it has
>       nothing
>       to do with joining (that we can influence).  It is a deficiency
>       in the
>       BioMart software that we are aware of and you are not the first
>       user to
>       stumble upon it.  We will obviously make sure that the BioMart
>       team in
>       Toronto are made aware of this (again).
> 
> 
> 
> I hope it explains the issue but do email again if you have further
> questions. We will contact the BioMart team again and see if they have a
> plan to fix this issue.
> Kind regards
> Rhoda
> 
> 
> 
> On 25 Oct 2013, at 22:09, "Thomas Danhorn" <danhornt at njhealth.org> wrote:
>
>       There is bug in Ensembl BioMart that omits certain genes from
>       the results, when certain attributes are chosen.  This affects
>       all versions of Ensembl I looked at, both the current one (73)
>       and the archive (tested 65 and 67), for at least mouse and human
>       (did not check other organisms).
>
>       The effect is severe, tens of thousands of genes will not show
>       up in the output of the results, but there is no indication of
>       this until the output is examined - the "Count"-button will
>       always show the expected number of genes.  Note that I am not
>       complaining about empty fields for certain attributes, but about
>       the disappearance of whole lines (i.e. genes/transcripts) when
>       (and only when) those attributes are chosen.
>
>       The affected attributes include all of the GO and GOSlim GOA
>       fields, several of the "External References" (e.g. EMBL
>       (Genbank) ID and Human Protein Atlas Antibody ID, but not the
>       various Ensemble and VEGA fields, PUBMED ID, or UCSC ID), and
>       most of the "PROTEIN DOMAINS" information (but not the Ensemble
>       fields).  All the "GENE", "EXPRESSION", and "Microarray Probes"
>       fields I looked at don't seem to cause this problem.
>
>       Steps to repoduce:
>       - Go to the Ensebl BioMart website (current or archive).  Choose
>       "Ensemble Genes [version]" and an organism (mouse, human).
>       - Click in "Filters".  Check "ID List Limit" and enter
>       ENSG00000002079 (for human) or ENSMUSG00000000031 (for mouse) in
>       the box (with "Ensemble Gene ID(s)" selected).  Note that this
>       filtration step has nothing to do with the problem per se and
>       can be left out, but it reduces the results to 0 vs 1 as opposed
>       to e.g. 24k vs 55k.  The gene IDs I list here are among the
>       thousands that show this behavior; if you add others, those
>       might or might not.
>       - Click the "Count"-button - it will show 1 gene.
>       - Click "Results" - you will see a line for each transcript
>       (default attributes are Ensemble gene and transcript IDs), as
>       expected.
>       - Now click "Attributes" and check "GO Term Accession" (or any
>       of the other problematic ones listed above).
>       - Click "Count" again - this will still show 1 gene.
>       - Click "Results", and you will see a header line with your
>       selected attributes (like before), but no lines for the
>       gene/transcripts.  If you download the results, the file will
>       only contain the header.  I would expect to see here the same
>       output (one line per transcript) as above, with an extra
>       (perhaps empty) column for the "GO Term Accession" (or whatever
>       other problematic attribute you selected).
>
>       I have observed this behavior over several days/weeks, so don't
>       think this is a momentary server glitch.  I doubt that it makes
>       a difference, but I am using Firefox 20.0 on Linux.  Please let
>       me know if you need additional information (more problematic
>       gene IDs, etc.).  I could not check if this has been reported
>       before (due to the server problems the mail archive is still
>       down), but I hope this information is helpful.
>
>       Thanks,
>
>       Thomas
> 
>
>       NOTICE: This email message is for the sole use of the intended
>       recipient(s) and may contain confidential and privileged
>       information. Any unauthorized review, use, disclosure or
>       distribution is prohibited. If you are not the intended
>       recipient, please contact the sender by reply email and destroy
>       all copies of the original message.
>
>       _______________________________________________
>       Dev mailing list    Dev at ensembl.org
>       Posting guidelines and subscribe/unsubscribe info:
>       http://lists.ensembl.org/mailman/listinfo/dev
>       Ensembl Blog: http://www.ensembl.info/
> 
> 
> Rhoda Kinsella Ph.D.
> Ensembl Production Project Leader
> European Bioinformatics Institute (EMBL-EBI)
> European Molecular Biology Laboratory
> Wellcome Trust Genome Campus
> Hinxton,
> Cambridge
> CB10 1SD
> 
> 
> 
> 
> 
>
NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.