[ensembl-dev] Bug?? Error Mapping EnsemblID to entrez id

mag mr6 at ebi.ac.uk
Fri Sep 4 15:35:07 BST 2015


Hi Ashok,

On the gene page, we display the best RefSeq match we have based on 
coordinate overlap
http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000196873;r=9:68232003-68300015
This is what returns the EntrezGene id 55871.
You can see the data used on the following page:
http://useast.ensembl.org/Homo_sapiens/Share/209c84cdf8bea2ba5fb9e41097f883db3066398
which shows the overlap between RefSeq and Ensembl transcripts.
This mapping is only done for curated RefSeq transcripts, with NM_ 
identifiers.


The full list of mappings, including EntrezGene mappings obtained 
through alignments to predicted RefSeq transcripts can be seen on the 
external references page
http://useast.ensembl.org/Homo_sapiens/Gene/Matches?db=core;g=ENSG00000196873;r=9:68232003-68300015;t=ENST00000618217
and this is what Biomart returns.

It should still be safe to use the EntrezGene id as displayed on the 
main gene page, as these come via the curated models and are more 
reliable than the predicted models.
Ensembl stable ids, gene names and the results from the RefSeq overlap 
entries are all included in our search indexes.
Other external references however are not, which is why you only get one 
result for a given EntrezGene id.

To ensure you are looking at the correct genes, cross-checking with 
other resources (or using bidirectional query) is a sensible option.
You can also compare the possible EntrezGene ids with the assigned HGNC 
symbol.
The EntrezGene id you want is likely to be the one that agrees with the 
HGNC mapping.
I have attached an example from Biomart.


Regards,
Magali

On 04/09/2015 14:29, Ragavendran, Ashok wrote:
> hi Magali,
>     Thanks for your prompt response. I can understand that there isn't 
> a 100% concordance across databases and that there are bound to be 
> some level of incongruence. However, what I am concerned about is the 
> inconsistency within the Ensembl databases themselves and again 
> perhaps I am not doing something right and would be grateful for any 
> suggestions on how to change my approach
>      To clarify:
>          1) using the ensembl GeneID as a key when i query biomart i 
> get 4 entrezIds as seen below in my original email
>
>          2) However if I look the gene up using the search tool on 
> ensembl I get *only 1 entrez id*...in this case ENSG00000196873=445571
> http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000196873;r=9:68232003-68300015
>          3) if i look up  entrez id 55871 on ncbi i get the following 
> (http://www.ncbi.nlm.nih.gov/gene/55871)
>
>         Official Symbol CBWD1provided by HGNC
>         Official Full Name :COBW domain containing 1provided by HGNC
>         Primary source HGNC:HGNC:17134
>         See related Ensembl:ENSG00000172785; HPRD:16686; MIM:611078;
>         Vega:OTTHUMG00000019425
>         Gene type protein coding
>         RefSeq status VALIDATED
>         Organism Homo sapiens
>         Lineage Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
>         Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates;
>         Haplorrhini; Catarrhini; Hominidae; Homo
>         Also known as COBP Orthologs all
>
>         which corresponds to ENSG00000172785 and if I then look that
>     EnsemblID i get ENSG00000172785=55871 *only 1 entrez ID*
>     http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000172785;r=9:121038-179147
>
>     4) Finally if i query Biomart using the Entrez IDs i get the
>     following Table:
>
>         Ensembl Gene ID    Associated Gene Name EntrezGene ID
>     ENSG00000136682    CBWD2    150472
>     ENSG00000147996    CBWD5    220869
>     ENSG00000147996    CBWD5    55871
>     ENSG00000172785    CBWD1    150472
>     ENSG00000172785    CBWD1    55871
>     ENSG00000196873    CBWD3    150472
>     ENSG00000196873    CBWD3    220869
>     ENSG00000196873    CBWD3    445571
>     ENSG00000196873    CBWD3    55871
>     ENSG00000215126    CBWD7    150472
>     ENSG00000215126    CBWD7    220869
>
>     Where we can see that ENSG00000196873 is associated with all 4
>     EntrezIDs.
>
>     This brings me to two questions:
>         1) Is the solution having to do an union of a bidirectional
>     query. That is first query with the Ensembl Gene IDs and then use
>     the resulting EntrezGeneId and query the Ensembl Gene ID and
>     create a union of the results?
>         2) The confusion arose partly from the web search result (see
>     the URL below) where it shows only one entrez ID associated with
>     the gene and the concordance in the NCBI hyperlink. Perhaps it
>     might be possible to*update the webpage to be consistent with the
>     biomart results*?? This is concerning because obviously when its a
>     single gene we usually search using the web interface and this
>     might possibly lead to erroneous conclusions
>     http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000196873;r=9:68232003-68300015
>
>     Once again thanks a lot for all your help in this regard.
>
>     Cheers
>         Ashok
>
>
>
> On 9/4/15 4:59 AM, dev-request at ensembl.org wrote:
>> Message: 2 Date: Fri, 04 Sep 2015 09:58:23 +0100 From: mag 
>> <mr6 at ebi.ac.uk> Subject: Re: [ensembl-dev] Bug?? Error Mapping 
>> EnsemblID to entrez id To: Dev <dev at ensembl.org> Message-ID: 
>> <55E95D2F.3060505 at ebi.ac.uk> Content-Type: text/plain; 
>> charset="windows-1252"; Format="flowed" Hi Ashok, Mapping between 
>> resources is a complicated process which unfortunately exposes some 
>> edge cases like this one. To map Ensembl genes to EntrezGene ids, 
>> there is no direct mapping available, hence we map via their 
>> respective transcripts, Ensembl transcripts and RefSeq mRNAs. Where 
>> the data is available, we attempt to map based on genomic 
>> coordinates, but when everything else fails, the sequences are 
>> aligned. Only the best hit is kept, but we do allow for mismatches as 
>> we know models can vary between Ensembl and RefSeq, in particular 
>> regarding UTR regions. In this particular example, the Ensembl 
>> transcript ENST00000618217 aligns very well against 3 separate RefSeq 
>> sequences 
>> http://e81.ensembl.org/Homo_sapiens/Transcript/Similarity?db=core;g=ENSG00000196873;r=9:68232003-68300015;t=ENST00000618217 
>> corresponding to CBDW1, CBDW2 and CBDW3 Another transcript, 
>> ENST00000377342, aligns against 2 different RefSeq sequences, 
>> corresponding to CBDW3 and CBDW5 
>> http://e81.ensembl.org/Homo_sapiens/Transcript/Similarity?db=core;g=ENSG00000196873;r=9:68232003-68300015;t=ENST00000377342 
>> As a result, we have not one but 4 different EntrezGene ids for the 
>> same Ensembl gene. Note that all these RefSeq sequences are predicted 
>> sequences, as noted by the XM_ prefix. This means that we would never 
>> use any of those EntrezGene ids to name the gene. However, we still 
>> provide the initial mappings as these are our best guess as to which 
>> RefSeq transcript corresponds to which Ensembl transcript. We are 
>> hoping to improve these mappings by including genomic coordinate 
>> information for predicted models, as this is already done for the 
>> curated RefSeq (NM_ like identifiers) This is unlikely to be 
>> available before the end of the year though. For correct gene naming, 
>> we recommend using HGNC identifiers, as these are obtained via 
>> curated direct mappings from HGNC, who update them regularly. Hope 
>> this helps, Magali On 03/09/2015 20:04, Ragavendran, Ashok wrote:
>>> >hello,
>>> >     I came upon this while using the Biomart interface. There are
>>> >errors mapping Ensembl Id to entrezgeneid. The ensembl id maps to the
>>> >wrong entrez, when I click the entrez link it shows a different
>>> >ensembl Id. Attached is a screenshot of the results. The Ensembl ID
>>> >refers to CBWD3, but the entrezgeneId are for CBDW1,CBDW2,CBDW5 and
>>> >CBDW3. The last result is the correct one, All others are wrong and
>>> >they actually have different Ensembl IDs, which is what i wanted to
>>> >retreive.
>>> >
>>> >     Is there something I am missing??
>>> >
>>> >Cheers
>>> >     Ashok
>>> >====== Text based Results from querying the gene id ENSG00000196873
>>> >=======
>>> >Ensembl Gene ID    EntrezGene ID
>>> >ENSG00000196873    55871
>>> >ENSG00000196873    150472
>>> >ENSG00000196873    220869
>>> >ENSG00000196873    445571
>>> >
>>> >
>>> >===== Screenshot of results: May not come through ===
>>> >
>>> >
>>> >
>>> >-- 
>>> >Ashok Ragavendran
>>> >Bioinformatics Specialist
>>> >Center for Human Genetic Research
>>> >Massachusetts General Hospital
>>> >Richard B. Simches Research Center
>>> >185 Cambridge St, Boston MA 02114
>>> >aragavendran at mgh.harvard.edu
>>> >ph: +1-617-726-1329
>>> >
>>> >The information in this e-mail is intended only for the person to whom
>>> >it is
>>> >addressed. If you believe this e-mail was sent to you in error and the
>>> >e-mail
>>> >contains patient information, please contact the Partners Compliance
>>> >HelpLine at
>>> >http://www.partners.org/complianceline  . If the e-mail was sent to you
>>> >in error
>>> >but does not contain patient information, please contact the sender
>>> >and properly
>>> >dispose of the e-mail.
>>> >
>>> >
>>> >
>>> >_______________________________________________
>>> >Dev mailing listDev at ensembl.org
>>> >Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>>> >Ensembl Blog:http://www.ensembl.info/
>
> -- 
> Ashok Ragavendran
> Bioinformatics Specialist
> Center for Human Genetic Research
> Massachusetts General Hospital
> Richard B. Simches Research Center
> 185 Cambridge St, Boston MA 02114
> aragavendran at mgh.harvard.edu
> ph: +1-617-726-1329
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150904/fff7e3a2/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2015-09-04 at 14.54.02.png
Type: image/png
Size: 34975 bytes
Desc: not available
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150904/fff7e3a2/attachment.png>


More information about the Dev mailing list