[ensembl-dev] Biomart downloading inconsistencies

Denise Carvalho-Silva denise at ebi.ac.uk
Tue Feb 3 20:59:03 GMT 2015


Hi Venetia,

The useast is a mirror of the main site and should contain the same data
as our www.ensembl.org in the UK.

This is so much so that when I search for the genes from your list below,
e.g. ENSG00000001084, ENSG00000090989 on useast.ensembl.org I do get
results for them, so they are not missing on useast.

http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000001084;r=6:53497341-53616970

http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000090989;r=4:55853616-55905034

http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000213347;r=5:177301461-177312757

In addition to this, when I perform a query in BioMart (on both useast and
main sites) using the filter you did, I do get the same number of entries
as expected, i.e. 60249 / 64814 Genes (see the attached screenshots).

BioMart is not the ideal tool for retrieval of genome wide data. So what I
think it's happening here is that the results get truncated on the useast
since the servers get overloaded there as it's working hours in the US
(but almost bed time in the UK) and there is a time out on the website.

You could perhaps try to perform this query in batches (for example for
each chr at once).

The best alternative for genome wide queries is to use our Perl APIs:

http://www.ensembl.org/info/docs/api/core/index.html#api
http://www.ensembl.org/info/docs/api/core/core_tutorial.html
http://www.ensembl.org/info/docs/Doxygen/core-api/index.html

Hope it makes sense.

Kind regards,

Denise
Ensembl Outreach


> Hello,
>
> I'm sending because I found a discrepancy between the
> http://useast.ensembl.org/biomart/martview/ and
> <http://www.ensembl.org/biomart/martservice>
> http://www.ensembl.org/biomart/<http://www.ensembl.org/biomart/martview>martview<http://www.ensembl.org/biomart/martview>
> web services and would like to confirm that the data I have are complete.
>
> I submitted the following search on both the UK and the US East servers:
>
> Dataset
> Homo sapiens genes (GRCh38)
> Filters
> With protein(Genbank) ID(s): Excluded
> Attributes
> Ensembl Gene ID
> Ensembl Transcript ID
> Chromosome Name
> Strand
> Unspliced (Transcript)
> Associated Gene Name
>
> But I got different results. The difference is quite big as they include
> 67,482 and 78,820 sequences accordingly. As far as I checked the file from
> ensembl.org contained all entries that useast.ensembl.org had, including
> some more.
>
> Here are some genes that weren't included in the useast.ensmbl.org but
> were present in ensembl.org:
> ENSG00000001084
> ENSG00000090989
> ENSG00000104723
> ENSG00000108846
> ENSG00000109158
> ENSG00000109171
> ENSG00000109180
> ENSG00000109182
> ENSG00000109184
> ENSG00000109452
> ENSG00000118564
> ENSG00000118579
> ENSG00000120708
> ENSG00000123415
> ENSG00000129187
> ENSG00000133835
> ENSG00000138678
> ENSG00000145216
> ENSG00000145868
> ENSG00000151466
> ENSG00000163138
> ENSG00000170365
> ENSG00000180104
> ENSG00000196353
> ENSG00000213347
> ENSG00000234492
> ENSG00000245526
> ENSG00000250328
> ENSG00000278610
>
> After noticing this I re-downloaded from useast.ensembl.org several times.
> Each time the file had a different size and none of the files had the same
> size as the ensembl.org one.
>
> I would like to know whether the data I downloaded from ensembl.org
> include all results or if you suggest getting them again in a different
> way.
>
> Thank you in advance,
> Venetia
>
> The information contained in this transmission contains privileged and
> confidential information. It is intended only for the use of the person
> named above. If you are not the intended recipient, you are hereby
> notified that any review, dissemination, distribution or duplication of
> this communication is strictly prohibited. If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message.
>
> CAUTION: Intended recipients should NOT use email communication for
> emergent or urgent health care matters.
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen shot 2015-02-03 at 20.35.39.png
Type: image/png
Size: 68681 bytes
Desc: not available
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150203/f7a1ccd6/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen shot 2015-02-03 at 20.36.36.png
Type: image/png
Size: 73168 bytes
Desc: not available
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150203/f7a1ccd6/attachment-0001.png>


More information about the Dev mailing list