[ensembl-dev] Biomart downloading inconsistencies

Andrew Yates ayates at ebi.ac.uk
Wed Feb 4 17:13:49 GMT 2015


Hi,

BioMart can be problematic when requesting complex data sets such as large scale sequence retrieval. Should you want to continue to use BioMart then you can ask it to email the results back to you. If you want to interact with it more programmatically then you could always perform the first query asking for genes without a Genbank protein ID and then chunking this list into blocks of 500.

Hope this helps,

Andy

------------
Andrew Yates - Ensembl Support Coordinator
European Molecular Biology Laboratory
European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton, Cambridge
CB10 1SD, United Kingdom
Tel: +44-(0)1223-492538
Fax: +44-(0)1223-494468
Skype: andrewyatz
http://www.ensembl.org/

> On 4 Feb 2015, at 16:53, Venetia Pliatsika <Venetia.Pliatsika at jefferson.edu> wrote:
> 
> Denise,
> 
> I need to make a correction on my previous email.
> 
> When I'm downloading from the perl APIs I use a subset of chromosomes so,
> if I download from ensembl.org the gene counts I get are in agreement with
> the gene counts from the BIOMART website.
> 
> To sum up I got:
> Perl APIs + ensembl.org | all results
> Perl APIs + useast.ensembl.org| partial results
> BIOMART web app + ensembl.org| partial results
> BIOMART web app + useast.ensembl.org| partial results
> 
> I'm quite confident that the results I got from Perl APIs + ensembl.org
> are whole because I submitted several other queries and they all seem to
> agree with the BIOMART web app counts. However, I would still like to know
> whether there is a way to confirm that the transcript count is also
> correct.
> 
> Thank you very much again,
> Venetia
> 
> 
> 
> 
> 
> On 2/4/15 9:51 AM, "Venetia Pliatsika" <Venetia.Pliatsika at jefferson.edu>
> wrote:
> 
>> Denise,
>> 
>> Thank you for your reply.
>> 
>> My team is always downloading using the perl APIs and I did use them.
>> However, I still have the same problem even with those. The files I get
>> seem incomplete, in fact I got maximum 56330 genes through the perl APIs
>> and, through BIOMART the max I got was 34413. We are also downloading one
>> chromosome at a time when using the perl APIs.
>> 
>> 
>> I tried accessing the individual genes through BIOMART and, indeed, I can
>> see them on both servers. Also, when I perform a result count using the
>> same filters I do get the same numbers (60249 / 64814) but, as I
>> mentioned, I'm unable to download all of them. Is there a count for the
>> distinct transcripts as well?
>> 
>> So, do you have another suggestion as to how to download them and, is
>> there a way to confirm that the results I get are complete other than the
>> gene count?
>> 
>> Thank you again,
>> Venetia
>> 
>> 
>> 
>> On 2/3/15 3:59 PM, "Denise Carvalho-Silva" <denise at ebi.ac.uk> wrote:
>> 
>>> Hi Venetia,
>>> 
>>> The useast is a mirror of the main site and should contain the same data
>>> as our www.ensembl.org in the UK.
>>> 
>>> This is so much so that when I search for the genes from your list below,
>>> e.g. ENSG00000001084, ENSG00000090989 on useast.ensembl.org I do get
>>> results for them, so they are not missing on useast.
>>> 
>>> http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0000000
>>> 1
>>> 084;r=6:53497341-53616970
>>> 
>>> http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0000009
>>> 0
>>> 989;r=4:55853616-55905034
>>> 
>>> http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0000021
>>> 3
>>> 347;r=5:177301461-177312757
>>> 
>>> In addition to this, when I perform a query in BioMart (on both useast
>>> and
>>> main sites) using the filter you did, I do get the same number of entries
>>> as expected, i.e. 60249 / 64814 Genes (see the attached screenshots).
>>> 
>>> BioMart is not the ideal tool for retrieval of genome wide data. So what
>>> I
>>> think it's happening here is that the results get truncated on the useast
>>> since the servers get overloaded there as it's working hours in the US
>>> (but almost bed time in the UK) and there is a time out on the website.
>>> 
>>> You could perhaps try to perform this query in batches (for example for
>>> each chr at once).
>>> 
>>> The best alternative for genome wide queries is to use our Perl APIs:
>>> 
>>> http://www.ensembl.org/info/docs/api/core/index.html#api
>>> http://www.ensembl.org/info/docs/api/core/core_tutorial.html
>>> http://www.ensembl.org/info/docs/Doxygen/core-api/index.html
>>> 
>>> Hope it makes sense.
>>> 
>>> Kind regards,
>>> 
>>> Denise
>>> Ensembl Outreach
>>> 
>>> 
>>>> Hello,
>>>> 
>>>> I'm sending because I found a discrepancy between the
>>>> http://useast.ensembl.org/biomart/martview/ and
>>>> <http://www.ensembl.org/biomart/martservice>
>>>> 
>>>> http://www.ensembl.org/biomart/<http://www.ensembl.org/biomart/martview>
>>>> m
>>>> artview<http://www.ensembl.org/biomart/martview>
>>>> web services and would like to confirm that the data I have are
>>>> complete.
>>>> 
>>>> I submitted the following search on both the UK and the US East
>>>> servers:
>>>> 
>>>> Dataset
>>>> Homo sapiens genes (GRCh38)
>>>> Filters
>>>> With protein(Genbank) ID(s): Excluded
>>>> Attributes
>>>> Ensembl Gene ID
>>>> Ensembl Transcript ID
>>>> Chromosome Name
>>>> Strand
>>>> Unspliced (Transcript)
>>>> Associated Gene Name
>>>> 
>>>> But I got different results. The difference is quite big as they
>>>> include
>>>> 67,482 and 78,820 sequences accordingly. As far as I checked the file
>>>> from
>>>> ensembl.org contained all entries that useast.ensembl.org had,
>>>> including
>>>> some more.
>>>> 
>>>> Here are some genes that weren't included in the useast.ensmbl.org but
>>>> were present in ensembl.org:
>>>> ENSG00000001084
>>>> ENSG00000090989
>>>> ENSG00000104723
>>>> ENSG00000108846
>>>> ENSG00000109158
>>>> ENSG00000109171
>>>> ENSG00000109180
>>>> ENSG00000109182
>>>> ENSG00000109184
>>>> ENSG00000109452
>>>> ENSG00000118564
>>>> ENSG00000118579
>>>> ENSG00000120708
>>>> ENSG00000123415
>>>> ENSG00000129187
>>>> ENSG00000133835
>>>> ENSG00000138678
>>>> ENSG00000145216
>>>> ENSG00000145868
>>>> ENSG00000151466
>>>> ENSG00000163138
>>>> ENSG00000170365
>>>> ENSG00000180104
>>>> ENSG00000196353
>>>> ENSG00000213347
>>>> ENSG00000234492
>>>> ENSG00000245526
>>>> ENSG00000250328
>>>> ENSG00000278610
>>>> 
>>>> After noticing this I re-downloaded from useast.ensembl.org several
>>>> times.
>>>> Each time the file had a different size and none of the files had the
>>>> same
>>>> size as the ensembl.org one.
>>>> 
>>>> I would like to know whether the data I downloaded from ensembl.org
>>>> include all results or if you suggest getting them again in a different
>>>> way.
>>>> 
>>>> Thank you in advance,
>>>> Venetia
>>>> 
>>>> The information contained in this transmission contains privileged and
>>>> confidential information. It is intended only for the use of the person
>>>> named above. If you are not the intended recipient, you are hereby
>>>> notified that any review, dissemination, distribution or duplication of
>>>> this communication is strictly prohibited. If you are not the intended
>>>> recipient, please contact the sender by reply email and destroy all
>>>> copies
>>>> of the original message.
>>>> 
>>>> CAUTION: Intended recipients should NOT use email communication for
>>>> emergent or urgent health care matters.
>>>> 
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>> 
>>> 
>>> 
>> 
> 
> The information contained in this transmission contains privileged and confidential information. It is intended only for the use of the person named above. If you are not the intended recipient, you are hereby notified that any review, dissemination, distribution or duplication of this communication is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
> 
> CAUTION: Intended recipients should NOT use email communication for emergent or urgent health care matters.
> 
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list