[ensembl-dev] Bug?? Error Mapping EnsemblID to entrez id

Ragavendran, Ashok ARAGAVENDRAN at mgh.harvard.edu
Fri Sep 4 14:29:10 BST 2015


hi Magali,
    Thanks for your prompt response. I can understand that there isn't a 100% concordance across databases and that there are bound to be some level of incongruence. However, what I am concerned about is the inconsistency within the Ensembl databases themselves and again perhaps I am not doing something right and would be grateful for any suggestions on how to change my approach
     To clarify:
         1) using the ensembl GeneID as a key when i query biomart i get 4 entrezIds as seen below in my original email

         2) However if I look the gene up using the search tool on ensembl I get only 1 entrez id...in this case ENSG00000196873=445571
                http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000196873;r=9:68232003-68300015
         3) if i look up  entrez id 55871 on ncbi i get the following (http://www.ncbi.nlm.nih.gov/gene/55871)
Official Symbol CBWD1provided by HGNC
Official Full Name :COBW domain containing 1provided by HGNC
Primary source HGNC:HGNC:17134
See related Ensembl:ENSG00000172785; HPRD:16686; MIM:611078; Vega:OTTHUMG00000019425
Gene type protein coding
RefSeq status VALIDATED
Organism Homo sapiens
Lineage Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo
Also known as COBP Orthologs all
    which corresponds to ENSG00000172785 and if I then look that EnsemblID i get ENSG00000172785=55871 only 1 entrez ID
http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000172785;r=9:121038-179147

4) Finally if i query Biomart using the Entrez IDs i get the following Table:

    Ensembl Gene ID    Associated Gene Name    EntrezGene ID
ENSG00000136682    CBWD2    150472
ENSG00000147996    CBWD5    220869
ENSG00000147996    CBWD5    55871
ENSG00000172785    CBWD1    150472
ENSG00000172785    CBWD1    55871
ENSG00000196873    CBWD3    150472
ENSG00000196873    CBWD3    220869
ENSG00000196873    CBWD3    445571
ENSG00000196873    CBWD3    55871
ENSG00000215126    CBWD7    150472
ENSG00000215126    CBWD7    220869

Where we can see that ENSG00000196873  is associated with all 4 EntrezIDs.

This brings me to two questions:
    1) Is the solution having to do an union of a bidirectional query. That is first query with the Ensembl Gene IDs and then use the resulting EntrezGeneId and query the Ensembl Gene ID and create a union of the results?
    2) The confusion arose partly from the web search result (see the URL below) where it shows only one entrez ID associated with the gene and the concordance in the NCBI hyperlink. Perhaps it might be possible to update the webpage to be consistent with the biomart results?? This is concerning because obviously when its a single gene we usually search using the web interface and this might possibly lead to erroneous conclusions
            http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000196873;r=9:68232003-68300015

Once again thanks a lot for all your help in this regard.

Cheers
    Ashok


On 9/4/15 4:59 AM, dev-request at ensembl.org<mailto:dev-request at ensembl.org> wrote:

Message: 2
Date: Fri, 04 Sep 2015 09:58:23 +0100
From: mag <mr6 at ebi.ac.uk><mailto:mr6 at ebi.ac.uk>
Subject: Re: [ensembl-dev] Bug?? Error Mapping EnsemblID to entrez id
To: Dev <dev at ensembl.org><mailto:dev at ensembl.org>
Message-ID: <55E95D2F.3060505 at ebi.ac.uk><mailto:55E95D2F.3060505 at ebi.ac.uk>
Content-Type: text/plain; charset="windows-1252"; Format="flowed"

Hi Ashok,

Mapping between resources is a complicated process which unfortunately
exposes some edge cases like this one.

To map Ensembl genes to EntrezGene ids, there is no direct mapping
available, hence we map via their respective transcripts, Ensembl
transcripts and RefSeq mRNAs.
Where the data is available, we attempt to map based on genomic
coordinates, but when everything else fails, the sequences are aligned.
Only the best hit is kept, but we do allow for mismatches as we know
models can vary between Ensembl and RefSeq, in particular regarding UTR
regions.
In this particular example, the Ensembl transcript ENST00000618217
aligns very well against 3 separate RefSeq sequences
http://e81.ensembl.org/Homo_sapiens/Transcript/Similarity?db=core;g=ENSG00000196873;r=9:68232003-68300015;t=ENST00000618217
corresponding to CBDW1, CBDW2 and CBDW3
Another transcript, ENST00000377342, aligns against 2 different RefSeq
sequences, corresponding to CBDW3 and CBDW5
http://e81.ensembl.org/Homo_sapiens/Transcript/Similarity?db=core;g=ENSG00000196873;r=9:68232003-68300015;t=ENST00000377342

As a result, we have not one but 4 different EntrezGene ids for the same
Ensembl gene.
Note that all these RefSeq sequences are predicted sequences, as noted
by the XM_ prefix.
This means that we would never use any of those EntrezGene ids to name
the gene.
However, we still provide the initial mappings as these are our best
guess as to which RefSeq transcript corresponds to which Ensembl transcript.

We are hoping to improve these mappings by including genomic coordinate
information for predicted models, as this is already done for the
curated RefSeq (NM_ like identifiers)
This is unlikely to be available before the end of the year though.

For correct gene naming, we recommend using HGNC identifiers, as these
are obtained via curated direct mappings from HGNC, who update them
regularly.


Hope this helps,
Magali

On 03/09/2015 20:04, Ragavendran, Ashok wrote:


> hello,
>     I came upon this while using the Biomart interface. There are
> errors mapping Ensembl Id to entrezgeneid. The ensembl id maps to the
> wrong entrez, when I click the entrez link it shows a different
> ensembl Id. Attached is a screenshot of the results. The Ensembl ID
> refers to CBWD3, but the entrezgeneId are for CBDW1,CBDW2,CBDW5 and
> CBDW3. The last result is the correct one, All others are wrong and
> they actually have different Ensembl IDs, which is what i wanted to
> retreive.
>
>     Is there something I am missing??
>
> Cheers
>     Ashok
> ====== Text based Results from querying the gene id ENSG00000196873
> =======
> Ensembl Gene ID    EntrezGene ID
> ENSG00000196873    55871
> ENSG00000196873    150472
> ENSG00000196873    220869
> ENSG00000196873    445571
>
>
> ===== Screenshot of results: May not come through ===
>
>
>
> --
> Ashok Ragavendran
> Bioinformatics Specialist
> Center for Human Genetic Research
> Massachusetts General Hospital
> Richard B. Simches Research Center
> 185 Cambridge St, Boston MA 02114
> aragavendran at mgh.harvard.edu<mailto:aragavendran at mgh.harvard.edu>
> ph: +1-617-726-1329
>
> The information in this e-mail is intended only for the person to whom
> it is
> addressed. If you believe this e-mail was sent to you in error and the
> e-mail
> contains patient information, please contact the Partners Compliance
> HelpLine at
> http://www.partners.org/complianceline . If the e-mail was sent to you
> in error
> but does not contain patient information, please contact the sender
> and properly
> dispose of the e-mail.
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/



--
Ashok Ragavendran
Bioinformatics Specialist
Center for Human Genetic Research
Massachusetts General Hospital
Richard B. Simches Research Center
185 Cambridge St, Boston MA 02114
aragavendran at mgh.harvard.edu<mailto:aragavendran at mgh.harvard.edu>
ph: +1-617-726-1329
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150904/a1d7c468/attachment.html>


More information about the Dev mailing list