[ensembl-dev] Biomart downloading inconsistencies

Venetia Pliatsika Venetia.Pliatsika at jefferson.edu
Wed Feb 4 14:51:23 GMT 2015


Denise,

Thank you for your reply.

My team is always downloading using the perl APIs and I did use them.
However, I still have the same problem even with those. The files I get
seem incomplete, in fact I got maximum 56330 genes through the perl APIs
and, through BIOMART the max I got was 34413. We are also downloading one
chromosome at a time when using the perl APIs.


I tried accessing the individual genes through BIOMART and, indeed, I can
see them on both servers. Also, when I perform a result count using the
same filters I do get the same numbers (60249 / 64814) but, as I
mentioned, I'm unable to download all of them. Is there a count for the
distinct transcripts as well?

So, do you have another suggestion as to how to download them and, is
there a way to confirm that the results I get are complete other than the
gene count?

Thank you again,
Venetia



On 2/3/15 3:59 PM, "Denise Carvalho-Silva" <denise at ebi.ac.uk> wrote:

>Hi Venetia,
>
>The useast is a mirror of the main site and should contain the same data
>as our www.ensembl.org in the UK.
>
>This is so much so that when I search for the genes from your list below,
>e.g. ENSG00000001084, ENSG00000090989 on useast.ensembl.org I do get
>results for them, so they are not missing on useast.
>
>http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000001
>084;r=6:53497341-53616970
>
>http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000090
>989;r=4:55853616-55905034
>
>http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000213
>347;r=5:177301461-177312757
>
>In addition to this, when I perform a query in BioMart (on both useast and
>main sites) using the filter you did, I do get the same number of entries
>as expected, i.e. 60249 / 64814 Genes (see the attached screenshots).
>
>BioMart is not the ideal tool for retrieval of genome wide data. So what I
>think it's happening here is that the results get truncated on the useast
>since the servers get overloaded there as it's working hours in the US
>(but almost bed time in the UK) and there is a time out on the website.
>
>You could perhaps try to perform this query in batches (for example for
>each chr at once).
>
>The best alternative for genome wide queries is to use our Perl APIs:
>
>http://www.ensembl.org/info/docs/api/core/index.html#api
>http://www.ensembl.org/info/docs/api/core/core_tutorial.html
>http://www.ensembl.org/info/docs/Doxygen/core-api/index.html
>
>Hope it makes sense.
>
>Kind regards,
>
>Denise
>Ensembl Outreach
>
>
>> Hello,
>>
>> I'm sending because I found a discrepancy between the
>> http://useast.ensembl.org/biomart/martview/ and
>> <http://www.ensembl.org/biomart/martservice>
>>
>>http://www.ensembl.org/biomart/<http://www.ensembl.org/biomart/martview>m
>>artview<http://www.ensembl.org/biomart/martview>
>> web services and would like to confirm that the data I have are
>>complete.
>>
>> I submitted the following search on both the UK and the US East servers:
>>
>> Dataset
>> Homo sapiens genes (GRCh38)
>> Filters
>> With protein(Genbank) ID(s): Excluded
>> Attributes
>> Ensembl Gene ID
>> Ensembl Transcript ID
>> Chromosome Name
>> Strand
>> Unspliced (Transcript)
>> Associated Gene Name
>>
>> But I got different results. The difference is quite big as they include
>> 67,482 and 78,820 sequences accordingly. As far as I checked the file
>>from
>> ensembl.org contained all entries that useast.ensembl.org had, including
>> some more.
>>
>> Here are some genes that weren't included in the useast.ensmbl.org but
>> were present in ensembl.org:
>> ENSG00000001084
>> ENSG00000090989
>> ENSG00000104723
>> ENSG00000108846
>> ENSG00000109158
>> ENSG00000109171
>> ENSG00000109180
>> ENSG00000109182
>> ENSG00000109184
>> ENSG00000109452
>> ENSG00000118564
>> ENSG00000118579
>> ENSG00000120708
>> ENSG00000123415
>> ENSG00000129187
>> ENSG00000133835
>> ENSG00000138678
>> ENSG00000145216
>> ENSG00000145868
>> ENSG00000151466
>> ENSG00000163138
>> ENSG00000170365
>> ENSG00000180104
>> ENSG00000196353
>> ENSG00000213347
>> ENSG00000234492
>> ENSG00000245526
>> ENSG00000250328
>> ENSG00000278610
>>
>> After noticing this I re-downloaded from useast.ensembl.org several
>>times.
>> Each time the file had a different size and none of the files had the
>>same
>> size as the ensembl.org one.
>>
>> I would like to know whether the data I downloaded from ensembl.org
>> include all results or if you suggest getting them again in a different
>> way.
>>
>> Thank you in advance,
>> Venetia
>>
>> The information contained in this transmission contains privileged and
>> confidential information. It is intended only for the use of the person
>> named above. If you are not the intended recipient, you are hereby
>> notified that any review, dissemination, distribution or duplication of
>> this communication is strictly prohibited. If you are not the intended
>> recipient, please contact the sender by reply email and destroy all
>>copies
>> of the original message.
>>
>> CAUTION: Intended recipients should NOT use email communication for
>> emergent or urgent health care matters.
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>
>

The information contained in this transmission contains privileged and confidential information. It is intended only for the use of the person named above. If you are not the intended recipient, you are hereby notified that any review, dissemination, distribution or duplication of this communication is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

CAUTION: Intended recipients should NOT use email communication for emergent or urgent health care matters.






More information about the Dev mailing list