[ensembl-dev] Biomart downloading inconsistencies

Venetia Pliatsika Venetia.Pliatsika at jefferson.edu
Wed Feb 4 16:53:43 GMT 2015


Denise,

I need to make a correction on my previous email.

When I'm downloading from the perl APIs I use a subset of chromosomes so,
if I download from ensembl.org the gene counts I get are in agreement with
the gene counts from the BIOMART website.

To sum up I got:
Perl APIs + ensembl.org | all results
Perl APIs + useast.ensembl.org| partial results
BIOMART web app + ensembl.org| partial results
BIOMART web app + useast.ensembl.org| partial results

I'm quite confident that the results I got from Perl APIs + ensembl.org
are whole because I submitted several other queries and they all seem to
agree with the BIOMART web app counts. However, I would still like to know
whether there is a way to confirm that the transcript count is also
correct.

Thank you very much again,
Venetia





On 2/4/15 9:51 AM, "Venetia Pliatsika" <Venetia.Pliatsika at jefferson.edu>
wrote:

>Denise,
>
>Thank you for your reply.
>
>My team is always downloading using the perl APIs and I did use them.
>However, I still have the same problem even with those. The files I get
>seem incomplete, in fact I got maximum 56330 genes through the perl APIs
>and, through BIOMART the max I got was 34413. We are also downloading one
>chromosome at a time when using the perl APIs.
>
>
>I tried accessing the individual genes through BIOMART and, indeed, I can
>see them on both servers. Also, when I perform a result count using the
>same filters I do get the same numbers (60249 / 64814) but, as I
>mentioned, I'm unable to download all of them. Is there a count for the
>distinct transcripts as well?
>
>So, do you have another suggestion as to how to download them and, is
>there a way to confirm that the results I get are complete other than the
>gene count?
>
>Thank you again,
>Venetia
>
>
>
>On 2/3/15 3:59 PM, "Denise Carvalho-Silva" <denise at ebi.ac.uk> wrote:
>
>>Hi Venetia,
>>
>>The useast is a mirror of the main site and should contain the same data
>>as our www.ensembl.org in the UK.
>>
>>This is so much so that when I search for the genes from your list below,
>>e.g. ENSG00000001084, ENSG00000090989 on useast.ensembl.org I do get
>>results for them, so they are not missing on useast.
>>
>>http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0000000
>>1
>>084;r=6:53497341-53616970
>>
>>http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0000009
>>0
>>989;r=4:55853616-55905034
>>
>>http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0000021
>>3
>>347;r=5:177301461-177312757
>>
>>In addition to this, when I perform a query in BioMart (on both useast
>>and
>>main sites) using the filter you did, I do get the same number of entries
>>as expected, i.e. 60249 / 64814 Genes (see the attached screenshots).
>>
>>BioMart is not the ideal tool for retrieval of genome wide data. So what
>>I
>>think it's happening here is that the results get truncated on the useast
>>since the servers get overloaded there as it's working hours in the US
>>(but almost bed time in the UK) and there is a time out on the website.
>>
>>You could perhaps try to perform this query in batches (for example for
>>each chr at once).
>>
>>The best alternative for genome wide queries is to use our Perl APIs:
>>
>>http://www.ensembl.org/info/docs/api/core/index.html#api
>>http://www.ensembl.org/info/docs/api/core/core_tutorial.html
>>http://www.ensembl.org/info/docs/Doxygen/core-api/index.html
>>
>>Hope it makes sense.
>>
>>Kind regards,
>>
>>Denise
>>Ensembl Outreach
>>
>>
>>> Hello,
>>>
>>> I'm sending because I found a discrepancy between the
>>> http://useast.ensembl.org/biomart/martview/ and
>>> <http://www.ensembl.org/biomart/martservice>
>>>
>>>http://www.ensembl.org/biomart/<http://www.ensembl.org/biomart/martview>
>>>m
>>>artview<http://www.ensembl.org/biomart/martview>
>>> web services and would like to confirm that the data I have are
>>>complete.
>>>
>>> I submitted the following search on both the UK and the US East
>>>servers:
>>>
>>> Dataset
>>> Homo sapiens genes (GRCh38)
>>> Filters
>>> With protein(Genbank) ID(s): Excluded
>>> Attributes
>>> Ensembl Gene ID
>>> Ensembl Transcript ID
>>> Chromosome Name
>>> Strand
>>> Unspliced (Transcript)
>>> Associated Gene Name
>>>
>>> But I got different results. The difference is quite big as they
>>>include
>>> 67,482 and 78,820 sequences accordingly. As far as I checked the file
>>>from
>>> ensembl.org contained all entries that useast.ensembl.org had,
>>>including
>>> some more.
>>>
>>> Here are some genes that weren't included in the useast.ensmbl.org but
>>> were present in ensembl.org:
>>> ENSG00000001084
>>> ENSG00000090989
>>> ENSG00000104723
>>> ENSG00000108846
>>> ENSG00000109158
>>> ENSG00000109171
>>> ENSG00000109180
>>> ENSG00000109182
>>> ENSG00000109184
>>> ENSG00000109452
>>> ENSG00000118564
>>> ENSG00000118579
>>> ENSG00000120708
>>> ENSG00000123415
>>> ENSG00000129187
>>> ENSG00000133835
>>> ENSG00000138678
>>> ENSG00000145216
>>> ENSG00000145868
>>> ENSG00000151466
>>> ENSG00000163138
>>> ENSG00000170365
>>> ENSG00000180104
>>> ENSG00000196353
>>> ENSG00000213347
>>> ENSG00000234492
>>> ENSG00000245526
>>> ENSG00000250328
>>> ENSG00000278610
>>>
>>> After noticing this I re-downloaded from useast.ensembl.org several
>>>times.
>>> Each time the file had a different size and none of the files had the
>>>same
>>> size as the ensembl.org one.
>>>
>>> I would like to know whether the data I downloaded from ensembl.org
>>> include all results or if you suggest getting them again in a different
>>> way.
>>>
>>> Thank you in advance,
>>> Venetia
>>>
>>> The information contained in this transmission contains privileged and
>>> confidential information. It is intended only for the use of the person
>>> named above. If you are not the intended recipient, you are hereby
>>> notified that any review, dissemination, distribution or duplication of
>>> this communication is strictly prohibited. If you are not the intended
>>> recipient, please contact the sender by reply email and destroy all
>>>copies
>>> of the original message.
>>>
>>> CAUTION: Intended recipients should NOT use email communication for
>>> emergent or urgent health care matters.
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>
>>
>

The information contained in this transmission contains privileged and confidential information. It is intended only for the use of the person named above. If you are not the intended recipient, you are hereby notified that any review, dissemination, distribution or duplication of this communication is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

CAUTION: Intended recipients should NOT use email communication for emergent or urgent health care matters.






More information about the Dev mailing list