[ensembl-dev] getting the entrez gene id from an ensembl record

Thu Dec 2 10:59:35 GMT 2010

On 2 Dec 2010, at 10:33, Andreas Kahari wrote:

> On Thu, Dec 02, 2010 at 10:53:36AM +0100, Patrick Meidl wrote:
>> On Thu, Dec 02 2010, Andy Jenkinson <andy.jenkinson at ebi.ac.uk> wrote:
>> 
>>>> On Thu, Dec 02 2010, ian Longden <ianl at ebi.ac.uk> wrote:
>>>> 
>>>>> Use $gene->get_all_DBLinks as this gets the external database
>>>>> references on the transcripts and translations of the gene too.
>>>>> 
>>>>> DBEntries only gets the ones attached to the gene directly.
>>> 
>>> Also, whether something is attached to a gene or transcript is not
>>> always intuitive: as I understand it, it depends on how the mapping
>>> was done, not the logical data model relationships. You have to know
>>> the internal data model before you can use it. For example, why is an
>>> EntrezGene not a gene-related record?
>> 
>> exactly. I therefore think that what is now called get_all_DBLinks()
>> should have an intuitive name which highlights that in most cases,
>> _this_ is the right method for getting xrefs.
> 
> On the contrary.  If the users knows what external database they are
> querying for (which they often do), and they know what level the xref
> are annotated on (which they also often do), then get_all_DBEntries()
> is definitely the most correct method to call.  It is lots quicker than
> get_all_DBLinks().  The DBLinks method is a lazy catch-all.

Not sure about this. You may well know which external database you want and want to run the fastest query, but it's actually quite difficult sometimes to know the level the xref is annotated onto. Especially as the same database might be annotated onto different levels in different species', as Ian says. When I have used these methods in the past, I have -wanted- to use DBEntries for speed reasons, but been forced to use DBLinks because I don't know which object (gene/transcript/translation) to use in advance.

Something I have suggested before is metadata to store this information, so you can do a short lookup to know, in a given species, what level each database is mapped onto. This is still quite unintuitive when first using the API, but helps. It would even be possible to then have an API call which does this for you: $gene->get_all_Foo('EntrezGene'), using the metadata to work out what to provide. This would also make the issue of whether the gene-protein mapping is useful information moot, because it would still be in there.

On the subject of multiple mappings, this is a tough one. The fact that two proteins from the same gene can map to different EntrezGene isn't really qualitatively useful by itself (you still don't know which one(s) are "correct"), it's rather that you know that there is some type of discrepancy between Ensembl and Entrez. Most of the time, people just want to be told "the answer" and it's probably a good compromise to simply have 1 gene : many Entrez - people can understand that, even if they were only expecting one. Better still would be if Ensembl flagged which ones is "primary" (inferred for e.g. as "most proteins map to this Entrez foo, one of them maps to Entrez bar")

>> get_all_DBEntries() is a special case which most users won't need, and
>> the name should indicate its nature as well (and maybe have a less
>> catchy name so that people don't use it by accident, as is the case
>> now).
> 
> See above.
> 
> Cheers,
> Andreas
> 
> 
>> sure, all this is documented in the POD, but the beauty of the Ensembl
>> core API is that object and method names are so intuitive that in 95% of
>> the use cases you don't have to read the documentation. so, ironing out
>> the few wriggles would be cool.
>> 
>> cheers
>> 
>>    patrick
>> 
>> -- 
>> Patrick Meidl, Mag.
>> Bioinformatician
>> 
>> Ce-M-M-
>> Research Centre for Molecular Medicine
>> of the Austrian Academy of Science
>> 
>> Lazarettgasse 14 / AKH BT 25.3
>> Vienna, Austria
>> 
>> room 02.205
>> phone +43 1 40160 70016
>> email pmeidl at cemm.oeaw.ac.at
>> web http://www.cemm.at/
>> 
>> 
>> _______________________________________________
>> Dev mailing list
>> Dev at ensembl.org
>> http://lists.ensembl.org/mailman/listinfo/dev
>> 
> 
> -- 
> Andreas Kähäri, Ensembl Software Developer
> European Bioinformatics Institute (EMBL-EBI)
> Wellcome Trust Genome Campus
> Hinxton, Cambridge CB10 1SD, United Kingdom
> 
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev