[ensembl-dev] getting the entrez gene id from an ensembl record

Oliver, Gavin gavin.oliver at almacgroup.com
Thu Dec 2 10:22:35 GMT 2010


Is it possible to have Entrez genes at a couple of levels?

i.e. a top-level Entrez ID linked via the gene and the others below this at the transcript or translation stage?

Most end users expect a single accurate Entrez gene ID that will tie to a single HGNC ID.  They don't want to be aware of the actual complexity of things.  It would be good to be able to cater for this somehow.


-----Original Message-----
From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On Behalf Of ian Longden
Sent: 02 December 2010 10:05
To: Andy Jenkinson
Cc: dev
Subject: Re: [ensembl-dev] getting the entrez gene id from an ensembl record

The main differences between species having some database sources
mapped to different levels (gene, transcript, translation) is becouse
some species get there external database sources from an import and
others are done with the internal external database mapping code in
house.

For human release 60 if we take a closer look at the data we see :-

select e.db_name,ox.ensembl_object_type, count(*) from xref x,
object_xref ox, external_db e where x.external_db_id =
e.external_db_id and x.xref_id = ox.xref_id group by
e.db_name,ox.ensembl_object_type;
+--------------------------------+---------------------+----------+
| db_name                        | ensembl_object_type | count(*) |
+--------------------------------+---------------------+----------+
| CCDS                           | Transcript          |    30348 |
| Clone_based_ensembl_gene       | Gene                |    11140 |
| Clone_based_ensembl_transcript | Transcript          |    11598 |
| Clone_based_vega_gene          | Gene                |    16539 |
| Clone_based_vega_transcript    | Transcript          |    25143 |
| DBASS3                         | Gene                |      156 |
| DBASS5                         | Gene                |      243 |
| EMBL                           | Translation         |   522829 |
| Ens_Hs_gene                    | Gene                |      115 |
| Ens_Hs_transcript              | Transcript          |      118 |
| Ens_Hs_translation             | Translation         |      122 |
| ENS_LRG_gene                   | Gene                |      115 |
| ENS_LRG_transcript             | Transcript          |      113 |
| EntrezGene                     | Translation         |    37791 |
| GO                             | Translation         |   393984 |
| goslim_goa                     | Translation         |   460376 |
| HGNC                           | Transcript          |    23032 |
| HGNC                           | Gene                |    23147 |
| HGNC_automatic_gene            | Gene                |     6903 |
| HGNC_automatic_transcript      | Transcript          |    31295 |
| HGNC_curated_gene              | Gene                |    17883 |
| HGNC_curated_transcript        | Transcript          |    91567 |
| HPA                            | Translation         |    52282 |
| IPI                            | Translation         |   133758 |
| LRG                            | Gene                |      230 |
| MEROPS                         | Translation         |     1321 |
| MIM_GENE                       | Translation         |    28150 |
| MIM_MORBID                     | Translation         |    10298 |
| miRBase                        | Transcript          |      715 |
| OTTG                           | Gene                |    35306 |
| OTTT                           | Transcript          |    92207 |
| PDB                            | Translation         |    30592 |
| protein_id                     | Translation         |   493584 |
| PUBMED                         | Gene                |      131 |
| RefSeq_dna                     | Transcript          |    34750 |
| RefSeq_dna_predicted           | Transcript          |     7402 |
| RefSeq_genomic                 | Gene                |      427 |
| RefSeq_peptide                 | Translation         |    49603 |
| RefSeq_peptide_predicted       | Translation         |     4231 |
| RFAM                           | Transcript          |     5199 |
| shares_CDS_and_UTR_with_OTTT   | Transcript          |    22205 |
| shares_CDS_with_ENST           | Transcript          |      686 |
| shares_CDS_with_OTTT           | Transcript          |     1153 |
| UCSC                           | Transcript          |    69330 |
| UniGene                        | Transcript          |    44982 |
| Uniprot/SPTREMBL               | Translation         |   180414 |
| Uniprot/SWISSPROT              | Translation         |    27436 |
| Vega_gene                      | Gene                |      860 |
| Vega_transcript                | Transcript          |   229832 |
| Vega_translation               | Translation         |   105267 |
| WikiGene                       | Translation         |    37791 |
+--------------------------------+---------------------+----------+

So here we can see that the Entrez genes are mapped to the translations.

We could move this up onto the gene but the problem here is that if a
gene has alternative splicing then these may have different Entrez
gene identifiers and therefore we loose information which might be
useful.

We could have the identifier on the gene and translation but Biomart
does not like having the same source on two levels (i.e. in this case
gene and translation). Although we have made exceptions for this (MGI
and HGNC) but this  involves a lot of work.

get_all _DBLInks was designed to get around people having to know
which level each source is on.

If you want to do sql to get this data out then the above sql will
tell you what a source is mapped too.

We have no problems moving Entrez gene or others on to different
levels we were worried that we loose data that is important but if
demand is for this to be on Genes then this can be done.


-Ian Longden
Ensembl Developer.

On Thu, Dec 2, 2010 at 9:36 AM, Andy Jenkinson <andy.jenkinson at ebi.ac.uk> wrote:
> On 2 Dec 2010, at 07:51, Patrick Meidl wrote:
>
>> On Thu, Dec 02 2010, ian Longden <ianl at ebi.ac.uk> wrote:
>>
>>> Use $gene->get_all_DBLinks as this gets the external database
>>> references on the transcripts and translations of the gene too.
>>>
>>> DBEntries only gets the ones attached to the gene directly.
>>
>> I'm not sure if there are any stats, but this misunderstanding must rank
>> in the top 10 of the most frequently asked questions about the Ensembl
>> core API.
>>
>> in combination with the fact that these methods are also among the very
>> few where there is no naming consistency between the database tables and
>> the model object names (xref vs DBEntry), it might be a good idea to
>> think about more expressive names for the methods (and possibly the
>> models as well). the old names could be deprecated but kept as
>> aliases/proxies for backward compatibility.
>
> Also, whether something is attached to a gene or transcript is not always intuitive: as I understand it, it depends on how the mapping was done, not the logical data model relationships. You have to know the internal data model before you can use it. For example, why is an EntrezGene not a gene-related record?
>
>> just my 2c...
>
> And mine :)
>
>>
>>    patrick
>>
>> --
>> Patrick Meidl, Mag.
>> Bioinformatician
>>
>> Ce-M-M-
>> Research Centre for Molecular Medicine
>> of the Austrian Academy of Science
>>
>> Lazarettgasse 14 / AKH BT 25.3
>> Vienna, Austria
>>
>> room 02.205
>> phone +43 1 40160 70016
>> email pmeidl at cemm.oeaw.ac.at
>> web http://www.cemm.at/
>>
>>
>> _______________________________________________
>> Dev mailing list
>> Dev at ensembl.org
>> http://lists.ensembl.org/mailman/listinfo/dev
>
>
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev
>

_______________________________________________
Dev mailing list
Dev at ensembl.org
http://lists.ensembl.org/mailman/listinfo/dev

The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender.

Almac Group (UK) Limited, registered no. NI061368.  Almac Sciences Limited, registered no. NI041550.  Almac Discovery Limited, registered no. NI046249.  Almac Pharma Services Limited, registered no. NI045055.  Almac Clinical Services Limited, registered no. NI041905.  Almac Clinical Technologies Limited, registered no. NI061202.  Almac Diagnostics Limited, registered no. NI043067.  All preceding companies are registered in Northern Ireland with a registered office address of Almac House, 20 Seagoe Industrial Estate, Craigavon, BT63 5QD, UK.  

Almac Sciences (Scotland) Limited, registered in Scotland no. SC154034. 

Almac Clinical Services LLC, Almac Clinical Technologies LLC, Almac Diagnostics LLC, Almac Pharma Services LLC and Almac Sciences LLC are Delaware limited liability companies and Almac Group Incorporated is a Delaware Corporation.  More information on the Almac Group can be found on the Almac website: www.almacgroup.com






More information about the Dev mailing list