[ensembl-dev] Bug in Ensembl BioMart - missing genes if some attributes are selected

Thomas Danhorn danhornt at njhealth.org
Fri Oct 25 22:09:35 BST 2013


There is bug in Ensembl BioMart that omits certain genes from the results, 
when certain attributes are chosen.  This affects all versions of Ensembl 
I looked at, both the current one (73) and the archive (tested 65 and 
67), for at least mouse and human (did not check other organisms).

The effect is severe, tens of thousands of genes will not show up in the 
output of the results, but there is no indication of this until the 
output is examined - the "Count"-button will always show the expected 
number of genes.  Note that I am not complaining about empty fields for 
certain attributes, but about the disappearance of whole lines (i.e. 
genes/transcripts) when (and only when) those attributes are chosen.

The affected attributes include all of the GO and GOSlim GOA fields, 
several of the "External References" (e.g. EMBL (Genbank) ID and Human 
Protein Atlas Antibody ID, but not the various Ensemble and VEGA fields, 
PUBMED ID, or UCSC ID), and most of the "PROTEIN DOMAINS" information (but 
not the Ensemble fields).  All the "GENE", "EXPRESSION", and "Microarray 
Probes" fields I looked at don't seem to cause this problem.

Steps to repoduce:
- Go to the Ensebl BioMart website (current or archive).  Choose 
"Ensemble Genes [version]" and an organism (mouse, human).
- Click in "Filters".  Check "ID List Limit" and enter ENSG00000002079 
(for human) or ENSMUSG00000000031 (for mouse) in the box (with "Ensemble 
Gene ID(s)" selected).  Note that this filtration step has nothing to do 
with the problem per se and can be left out, but it reduces the results to 
0 vs 1 as opposed to e.g. 24k vs 55k.  The gene IDs I list here are among 
the thousands that show this behavior; if you add others, those might or 
might not.
- Click the "Count"-button - it will show 1 gene.
- Click "Results" - you will see a line for each transcript (default 
attributes are Ensemble gene and transcript IDs), as expected.
- Now click "Attributes" and check "GO Term Accession" (or any of the 
other problematic ones listed above).
- Click "Count" again - this will still show 1 gene.
- Click "Results", and you will see a header line with your selected 
attributes (like before), but no lines for the gene/transcripts.  If you 
download the results, the file will only contain the header.  I would 
expect to see here the same output (one line per transcript) as above, 
with an extra (perhaps empty) column for the "GO Term Accession" (or 
whatever other problematic attribute you selected).

I have observed this behavior over several days/weeks, so don't think this 
is a momentary server glitch.  I doubt that it makes a difference, but I 
am using Firefox 20.0 on Linux.  Please let me know if you need additional 
information (more problematic gene IDs, etc.).  I could not check if this 
has been reported before (due to the server problems the mail archive is 
still down), but I hope this information is helpful.

Thanks,

Thomas


NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.




More information about the Dev mailing list