[ensembl-dev] getting the entrez gene id from an ensembl record

ian Longden ianl at ebi.ac.uk
Thu Dec 2 10:04:34 GMT 2010


The main differences between species having some database sources
mapped to different levels (gene, transcript, translation) is becouse
some species get there external database sources from an import and
others are done with the internal external database mapping code in
house.

For human release 60 if we take a closer look at the data we see :-

select e.db_name,ox.ensembl_object_type, count(*) from xref x,
object_xref ox, external_db e where x.external_db_id =
e.external_db_id and x.xref_id = ox.xref_id group by
e.db_name,ox.ensembl_object_type;
+--------------------------------+---------------------+----------+
| db_name                        | ensembl_object_type | count(*) |
+--------------------------------+---------------------+----------+
| CCDS                           | Transcript          |    30348 |
| Clone_based_ensembl_gene       | Gene                |    11140 |
| Clone_based_ensembl_transcript | Transcript          |    11598 |
| Clone_based_vega_gene          | Gene                |    16539 |
| Clone_based_vega_transcript    | Transcript          |    25143 |
| DBASS3                         | Gene                |      156 |
| DBASS5                         | Gene                |      243 |
| EMBL                           | Translation         |   522829 |
| Ens_Hs_gene                    | Gene                |      115 |
| Ens_Hs_transcript              | Transcript          |      118 |
| Ens_Hs_translation             | Translation         |      122 |
| ENS_LRG_gene                   | Gene                |      115 |
| ENS_LRG_transcript             | Transcript          |      113 |
| EntrezGene                     | Translation         |    37791 |
| GO                             | Translation         |   393984 |
| goslim_goa                     | Translation         |   460376 |
| HGNC                           | Transcript          |    23032 |
| HGNC                           | Gene                |    23147 |
| HGNC_automatic_gene            | Gene                |     6903 |
| HGNC_automatic_transcript      | Transcript          |    31295 |
| HGNC_curated_gene              | Gene                |    17883 |
| HGNC_curated_transcript        | Transcript          |    91567 |
| HPA                            | Translation         |    52282 |
| IPI                            | Translation         |   133758 |
| LRG                            | Gene                |      230 |
| MEROPS                         | Translation         |     1321 |
| MIM_GENE                       | Translation         |    28150 |
| MIM_MORBID                     | Translation         |    10298 |
| miRBase                        | Transcript          |      715 |
| OTTG                           | Gene                |    35306 |
| OTTT                           | Transcript          |    92207 |
| PDB                            | Translation         |    30592 |
| protein_id                     | Translation         |   493584 |
| PUBMED                         | Gene                |      131 |
| RefSeq_dna                     | Transcript          |    34750 |
| RefSeq_dna_predicted           | Transcript          |     7402 |
| RefSeq_genomic                 | Gene                |      427 |
| RefSeq_peptide                 | Translation         |    49603 |
| RefSeq_peptide_predicted       | Translation         |     4231 |
| RFAM                           | Transcript          |     5199 |
| shares_CDS_and_UTR_with_OTTT   | Transcript          |    22205 |
| shares_CDS_with_ENST           | Transcript          |      686 |
| shares_CDS_with_OTTT           | Transcript          |     1153 |
| UCSC                           | Transcript          |    69330 |
| UniGene                        | Transcript          |    44982 |
| Uniprot/SPTREMBL               | Translation         |   180414 |
| Uniprot/SWISSPROT              | Translation         |    27436 |
| Vega_gene                      | Gene                |      860 |
| Vega_transcript                | Transcript          |   229832 |
| Vega_translation               | Translation         |   105267 |
| WikiGene                       | Translation         |    37791 |
+--------------------------------+---------------------+----------+

So here we can see that the Entrez genes are mapped to the translations.

We could move this up onto the gene but the problem here is that if a
gene has alternative splicing then these may have different Entrez
gene identifiers and therefore we loose information which might be
useful.

We could have the identifier on the gene and translation but Biomart
does not like having the same source on two levels (i.e. in this case
gene and translation). Although we have made exceptions for this (MGI
and HGNC) but this  involves a lot of work.

get_all _DBLInks was designed to get around people having to know
which level each source is on.

If you want to do sql to get this data out then the above sql will
tell you what a source is mapped too.

We have no problems moving Entrez gene or others on to different
levels we were worried that we loose data that is important but if
demand is for this to be on Genes then this can be done.


-Ian Longden
Ensembl Developer.

On Thu, Dec 2, 2010 at 9:36 AM, Andy Jenkinson <andy.jenkinson at ebi.ac.uk> wrote:
> On 2 Dec 2010, at 07:51, Patrick Meidl wrote:
>
>> On Thu, Dec 02 2010, ian Longden <ianl at ebi.ac.uk> wrote:
>>
>>> Use $gene->get_all_DBLinks as this gets the external database
>>> references on the transcripts and translations of the gene too.
>>>
>>> DBEntries only gets the ones attached to the gene directly.
>>
>> I'm not sure if there are any stats, but this misunderstanding must rank
>> in the top 10 of the most frequently asked questions about the Ensembl
>> core API.
>>
>> in combination with the fact that these methods are also among the very
>> few where there is no naming consistency between the database tables and
>> the model object names (xref vs DBEntry), it might be a good idea to
>> think about more expressive names for the methods (and possibly the
>> models as well). the old names could be deprecated but kept as
>> aliases/proxies for backward compatibility.
>
> Also, whether something is attached to a gene or transcript is not always intuitive: as I understand it, it depends on how the mapping was done, not the logical data model relationships. You have to know the internal data model before you can use it. For example, why is an EntrezGene not a gene-related record?
>
>> just my 2c...
>
> And mine :)
>
>>
>>    patrick
>>
>> --
>> Patrick Meidl, Mag.
>> Bioinformatician
>>
>> Ce-M-M-
>> Research Centre for Molecular Medicine
>> of the Austrian Academy of Science
>>
>> Lazarettgasse 14 / AKH BT 25.3
>> Vienna, Austria
>>
>> room 02.205
>> phone +43 1 40160 70016
>> email pmeidl at cemm.oeaw.ac.at
>> web http://www.cemm.at/
>>
>>
>> _______________________________________________
>> Dev mailing list
>> Dev at ensembl.org
>> http://lists.ensembl.org/mailman/listinfo/dev
>
>
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev
>




More information about the Dev mailing list