[ensembl-dev] Biomart downloading inconsistencies
Denise Carvalho-Silva
denise at ebi.ac.uk
Tue Feb 10 13:44:29 GMT 2015
Hi Venetia,
Regarding your question 'Is there a count for the distinct transcripts as well?', no there is not.
The count is available for genes only as the Gene database in BioMart is gene centric.
You may want to do your query one chr at a time in BioMart too as you do already with the Ensembl Perl API.
Kind regards,
Denise
On 4 Feb 2015, at 14:51, Venetia Pliatsika wrote:
> Denise,
>
> Thank you for your reply.
>
> My team is always downloading using the perl APIs and I did use them.
> However, I still have the same problem even with those. The files I get
> seem incomplete, in fact I got maximum 56330 genes through the perl APIs
> and, through BIOMART the max I got was 34413. We are also downloading one
> chromosome at a time when using the perl APIs.
>
>
> I tried accessing the individual genes through BIOMART and, indeed, I can
> see them on both servers. Also, when I perform a result count using the
> same filters I do get the same numbers (60249 / 64814) but, as I
> mentioned, I'm unable to download all of them. Is there a count for the
> distinct transcripts as well?
>
> So, do you have another suggestion as to how to download them and, is
> there a way to confirm that the results I get are complete other than the
> gene count?
>
> Thank you again,
> Venetia
>
>
>
> On 2/3/15 3:59 PM, "Denise Carvalho-Silva" <denise at ebi.ac.uk> wrote:
>
>> Hi Venetia,
>>
>> The useast is a mirror of the main site and should contain the same data
>> as our www.ensembl.org in the UK.
>>
>> This is so much so that when I search for the genes from your list below,
>> e.g. ENSG00000001084, ENSG00000090989 on useast.ensembl.org I do get
>> results for them, so they are not missing on useast.
>>
>> http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000001
>> 084;r=6:53497341-53616970
>>
>> http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000090
>> 989;r=4:55853616-55905034
>>
>> http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000213
>> 347;r=5:177301461-177312757
>>
>> In addition to this, when I perform a query in BioMart (on both useast and
>> main sites) using the filter you did, I do get the same number of entries
>> as expected, i.e. 60249 / 64814 Genes (see the attached screenshots).
>>
>> BioMart is not the ideal tool for retrieval of genome wide data. So what I
>> think it's happening here is that the results get truncated on the useast
>> since the servers get overloaded there as it's working hours in the US
>> (but almost bed time in the UK) and there is a time out on the website.
>>
>> You could perhaps try to perform this query in batches (for example for
>> each chr at once).
>>
>> The best alternative for genome wide queries is to use our Perl APIs:
>>
>> http://www.ensembl.org/info/docs/api/core/index.html#api
>> http://www.ensembl.org/info/docs/api/core/core_tutorial.html
>> http://www.ensembl.org/info/docs/Doxygen/core-api/index.html
>>
>> Hope it makes sense.
>>
>> Kind regards,
>>
>> Denise
>> Ensembl Outreach
>>
>>
>>> Hello,
>>>
>>> I'm sending because I found a discrepancy between the
>>> http://useast.ensembl.org/biomart/martview/ and
>>> <http://www.ensembl.org/biomart/martservice>
>>>
>>> http://www.ensembl.org/biomart/<http://www.ensembl.org/biomart/martview>m
>>> artview<http://www.ensembl.org/biomart/martview>
>>> web services and would like to confirm that the data I have are
>>> complete.
>>>
>>> I submitted the following search on both the UK and the US East servers:
>>>
>>> Dataset
>>> Homo sapiens genes (GRCh38)
>>> Filters
>>> With protein(Genbank) ID(s): Excluded
>>> Attributes
>>> Ensembl Gene ID
>>> Ensembl Transcript ID
>>> Chromosome Name
>>> Strand
>>> Unspliced (Transcript)
>>> Associated Gene Name
>>>
>>> But I got different results. The difference is quite big as they include
>>> 67,482 and 78,820 sequences accordingly. As far as I checked the file
>>> from
>>> ensembl.org contained all entries that useast.ensembl.org had, including
>>> some more.
>>>
>>> Here are some genes that weren't included in the useast.ensmbl.org but
>>> were present in ensembl.org:
>>> ENSG00000001084
>>> ENSG00000090989
>>> ENSG00000104723
>>> ENSG00000108846
>>> ENSG00000109158
>>> ENSG00000109171
>>> ENSG00000109180
>>> ENSG00000109182
>>> ENSG00000109184
>>> ENSG00000109452
>>> ENSG00000118564
>>> ENSG00000118579
>>> ENSG00000120708
>>> ENSG00000123415
>>> ENSG00000129187
>>> ENSG00000133835
>>> ENSG00000138678
>>> ENSG00000145216
>>> ENSG00000145868
>>> ENSG00000151466
>>> ENSG00000163138
>>> ENSG00000170365
>>> ENSG00000180104
>>> ENSG00000196353
>>> ENSG00000213347
>>> ENSG00000234492
>>> ENSG00000245526
>>> ENSG00000250328
>>> ENSG00000278610
>>>
>>> After noticing this I re-downloaded from useast.ensembl.org several
>>> times.
>>> Each time the file had a different size and none of the files had the
>>> same
>>> size as the ensembl.org one.
>>>
>>> I would like to know whether the data I downloaded from ensembl.org
>>> include all results or if you suggest getting them again in a different
>>> way.
>>>
>>> Thank you in advance,
>>> Venetia
>>>
>>> The information contained in this transmission contains privileged and
>>> confidential information. It is intended only for the use of the person
>>> named above. If you are not the intended recipient, you are hereby
>>> notified that any review, dissemination, distribution or duplication of
>>> this communication is strictly prohibited. If you are not the intended
>>> recipient, please contact the sender by reply email and destroy all
>>> copies
>>> of the original message.
>>>
>>> CAUTION: Intended recipients should NOT use email communication for
>>> emergent or urgent health care matters.
>>>
>>> _______________________________________________
>>> Dev mailing list Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>
>>
>
> The information contained in this transmission contains privileged and confidential information. It is intended only for the use of the person named above. If you are not the intended recipient, you are hereby notified that any review, dissemination, distribution or duplication of this communication is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
>
> CAUTION: Intended recipients should NOT use email communication for emergent or urgent health care matters.
>
>
More information about the Dev
mailing list