[ensembl-dev] Hyphenated entrez xrefs?

Sun Dec 15 16:51:03 GMT 2013

Perfect. Thanks for the sample query and thanks for further clarifying the situation with those Entrez Gene IDs. Placing those in a separate external db would nicely resolve the issue. Much appreciated!

 - Alex

> On Dec 15, 2013, at 4:35 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> 
> Hi Alex,
> 
> Firstly we do not have any documentation associated with the levels an external db is attached to. However Ensembl attempts to attach cross references to their most relevant object e.g. UniProt are linked to Ensembl Translations. If you want a look at what is currently linked to an object you can use the following query:
> 
> select distinct db_name, ensembl_object_type
> from external_db
> join xref using (external_db_id)
> join object_xref using (xref_id)
> order by db_name;
> 
> It's not fast to run but it will bring back the information.
> 
> Secondly it is not our intention to invent IDs like this and claim they are an external resource's ID. These names are actually made to name the transcripts providing a visual way to link transcripts to genes in our region displays. A few releases ago we decided to name transcripts in all species in the same way we do for human, mouse and zebrafish. This means taking the gene's name and adding an incrementing suffix. These manufactured names are normally linked to a different external db e.g. HGNC vs. HGNC_transcript_name. It seems that this did not happen with EntrezGene and is why this situation has occurred. We will rectify this and apologise for the confusion it has caused.
> 
> Hope this helps,
> 
> Andy
> 
> ------------
> Andrew Yates - Ensembl Support Coordinator
> European Bioinformatics Institute (EMBL-EBI)
> European Molecular Biology Laboratory
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> Tel: +44-(0)1223-492538
> Fax: +44-(0)1223-494468
> http://www.ensembl.org/
> 
>> On 14 Dec 2013, at 20:19, Alexander Pico <apico at gladstone.ucsf.edu> wrote:
>> 
>> Hi Magali,
>> 
>> Thanks for the clarification. Our script actually extracts xrefs for a couple dozen external dbs per $gene, including UniProt, so I think that's why we use DBLinks rather than DBEntries. Is there documentation on which external dbs have xrefs associated with genes vs transcripts and translations? Or do I just need to run both periodically and compare the results?
>> 
>> In general, I'd like to offer the feedback that making up identifiers, such as 288264-201, and calling them Entrez Gene database xrefs is poor form. They are no longer reliable nor useful as an identifiers. Entrez Gene does not recognize the ID and it breaks downstream applications that expect a proper ID.
>> 
>> I haven't seen this with any other external databases in the Ensembl xref tables yet. Are there other cases of manufactured IDs I should look out for in the DBLinks system? Is this practice isolated to Entrez Gene so far?
>> 
>> Thanks!
>> - Alex
>> 
>>> On Dec 14, 2013, at 7:05 AM, mr6 at ebi.ac.uk wrote:
>>> 
>>> Hi Alex,
>>> 
>>> These hyphenated extensions are used for transcripts.
>>> If a gene is associated to a given EntrezGene entry, we can use this to
>>> assign a name to all transcripts of that gene.
>>> To be able to distinguish those transcripts, we number them by adding
>>> -201, -202, etc..
>>> This is based on the numbering system already used for manual annotation.
>>> 
>>> In your query, you are using the method get_all_DBLinks, which will return
>>> all xrefs associated to the gene, as well as all DBEntries that are
>>> associated with the transcripts and corresponding translations of this
>>> gene.
>>> To retrieve only the DBEntries associated to the gene, you can use the
>>> method get_all_DBEntries.
>>> 
>>> For both method, get_all_DBLinks and get_all_DBEntries, you can add the
>>> external_db_name as an argument.
>>> $gene->get_all_DBEntries('EntrezGene') will return only EntrezGene xrefs
>>> for this gene.
>>> More information on the methods can be found here:
>>> http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Gene.html#a5aaf31a07a3d82c3841a411a0a55e81b
>>> 
>>> 
>>> Hope this helps,
>>> Magali
>>> 
>>> 
>>>> Dear Ensembl,
>>>> 
>>>> I've run across a number of examples of hyphenated entrez gene identifiers
>>>> in xref tables, starting back in release 72, for example:
>>>> 
>>>> rattus_norvegicus_core_72_5
>>>> 
>>>> +---------+----------------+---------------+---------------+---------+-----------------------------------------------+-----------+-----------+
>>>> | xref_id | external_db_id | dbprimary_acc | display_label | version |
>>>> description                                   | info_type | info_text |
>>>> +---------+----------------+---------------+---------------+---------+-----------------------------------------------+-----------+-----------+
>>>> |  576085 |           1300 | 288264        | Ifnar1        | 0       |
>>>> interferon (alpha, beta and omega) receptor 1 | DEPENDENT |           |
>>>> +---------+----------------+---------------+---------------+---------+-----------------------------------------------+-----------+-----------+
>>>> | 1143738 |           1300 | 288264-201    | Ifnar1-201    | 0       |
>>>> interferon (alpha, beta and omega) receptor 1 | MISC      | via gene name
>>>> |
>>>> +---------+----------------+---------------+---------------+---------+-----------------------------------------------+-----------+---------------+
>>>> 
>>>> The first result is accurate, but the second one is apparently
>>>> manufactored. This entry breaks a number of downstream uses for xrefs,
>>>> since the "-201" is not part of the official ID format for Entrez gene,
>>>> for example.
>>>> 
>>>> What are these? Are you planning on keeping these around in future xref
>>>> tables?
>>>> 
>>>> And how would you recommend avoiding these in xref queries using the Perl
>>>> API? Here's my current Perl psuedocode:
>>>> 
>>>> $gene->get_all_DBLinks();
>>>> foreach my $dbe (@$db_entries) {
>>>>    if ($dbe->dbname() =~ /^\'EntrezGene\'$/){
>>>>        //Collect xref associated with $gene
>>>>    }
>>>> }
>>>> 
>>>> What other filters or checks should I do to exclude the manufactored
>>>> identifiers associated with your Entrez Gene records?
>>>> 
>>>> Thanks!
>>>> - Alex
>>>> 
>>>> ----------------------------------------
>>>> Alexander Pico, PhD
>>>> NRNB Executive Director
>>>> Bioinformatics Assoc. Director
>>>> Gladstone Institutes
>>>> http://nrnb.org
>>>> http://gladstoneinstitutes.org
>>>> ----------------------------------------
>>>> 
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/