[ensembl-dev] Hyphenated entrez xrefs?

Sat Dec 14 15:05:47 GMT 2013

Hi Alex,

These hyphenated extensions are used for transcripts.
If a gene is associated to a given EntrezGene entry, we can use this to
assign a name to all transcripts of that gene.
To be able to distinguish those transcripts, we number them by adding
-201, -202, etc..
This is based on the numbering system already used for manual annotation.

In your query, you are using the method get_all_DBLinks, which will return
all xrefs associated to the gene, as well as all DBEntries that are
associated with the transcripts and corresponding translations of this
gene.
To retrieve only the DBEntries associated to the gene, you can use the
method get_all_DBEntries.

For both method, get_all_DBLinks and get_all_DBEntries, you can add the
external_db_name as an argument.
$gene->get_all_DBEntries('EntrezGene') will return only EntrezGene xrefs
for this gene.
More information on the methods can be found here:
http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Gene.html#a5aaf31a07a3d82c3841a411a0a55e81b

Hope this helps,
Magali

> Dear Ensembl,
>
> I've run across a number of examples of hyphenated entrez gene identifiers
> in xref tables, starting back in release 72, for example:
>
> rattus_norvegicus_core_72_5
>
> +---------+----------------+---------------+---------------+---------+-----------------------------------------------+-----------+-----------+
> | xref_id | external_db_id | dbprimary_acc | display_label | version |
> description                                   | info_type | info_text |
> +---------+----------------+---------------+---------------+---------+-----------------------------------------------+-----------+-----------+
> |  576085 |           1300 | 288264        | Ifnar1        | 0       |
> interferon (alpha, beta and omega) receptor 1 | DEPENDENT |           |
> +---------+----------------+---------------+---------------+---------+-----------------------------------------------+-----------+-----------+
> | 1143738 |           1300 | 288264-201    | Ifnar1-201    | 0       |
> interferon (alpha, beta and omega) receptor 1 | MISC      | via gene name
> |
> +---------+----------------+---------------+---------------+---------+-----------------------------------------------+-----------+---------------+
>
> The first result is accurate, but the second one is apparently
> manufactored. This entry breaks a number of downstream uses for xrefs,
> since the "-201" is not part of the official ID format for Entrez gene,
> for example.
>
> What are these? Are you planning on keeping these around in future xref
> tables?
>
> And how would you recommend avoiding these in xref queries using the Perl
> API? Here's my current Perl psuedocode:
>
> $gene->get_all_DBLinks();
> foreach my $dbe (@$db_entries) {
> 	if ($dbe->dbname() =~ /^\'EntrezGene\'$/){
> 		//Collect xref associated with $gene
> 	}
> }
>
> What other filters or checks should I do to exclude the manufactored
> identifiers associated with your Entrez Gene records?
>
> Thanks!
> - Alex
>
> ----------------------------------------
> Alexander Pico, PhD
> NRNB Executive Director
> Bioinformatics Assoc. Director
> Gladstone Institutes
> http://nrnb.org
> http://gladstoneinstitutes.org
> ----------------------------------------
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>