[ensembl-dev] [SPAM] - Re: [SPAM] - Re: Annotation discrepancy - Bayesian Filter detected spam - Email found in subject

Oliver, Gavin gavin.oliver at almacgroup.com
Tue Nov 23 17:36:36 GMT 2010


I have found another instance in Ensembl where the HGNC gene symbol
refers to version A but the description refers to version B.

Hopefully this is of help in diagnosing this particular issue.

Ensembl protein_coding Gene: ENSG00000169894 (HGNC Symbol: MUC3A)
[Region in detail]

Description: mucin 3B, cell surface associated [Source:HGNC
Symbol;Acc:13384]

Best,
 
Gavin

-----Original Message-----
From: Ewan Birney [mailto:birney at ebi.ac.uk] 
Sent: 19 November 2010 16:49
To: Oliver, Gavin
Cc: ian Longden; dev
Subject: [SPAM] - Re: [SPAM] - Re: [ensembl-dev] Annotation discrepancy
- Bayesian Filter detected spam - Email found in subject


On 19 Nov 2010, at 16:13, Oliver, Gavin wrote:

> Thanks for the input Ewan.
>
> In case of any misinterpretation let me just restate the problem.
>
> In most instances (7 of the 9 examples) I am successfully annotating  
> to
> an Ensembl gene ID.  The corresponding HGNC ID is retrieved, however I
> am not managing to retrieve the Entrez Gene id OR I am retrieving an
> Entrez Gene ID that differs from the one that represents the gene in  
> the
> HGNC and Entrez databases.
>

So - the two missing cases below is because the gene is a pseudogene,  
and
so doesn't trigger the same Xref process; in other words, what you are
seeing is that these genes (with HGNC names, and EntrezGene IDs) are
pseudogenes, and the pseudogenes don't have the same ID mapping process
as the coding genes. There is no the same attempt to be comprehensive
in pseudogenes as in protein coding genes, and our Xref mapping pipeline
is not as attuned to the issues.

I think there is a rather simple fix I think for at least first two,
in which we follow the Havana->HGNC->EntrezGene chain to pull in
EntrezGene IDs for pseudogenes with named HGNC symbols. (Ian/Glenn -
presumably doable, but I know we should be appropriately paranoid
about changes to the Xref code).


I think the CDK11B needs more investigation (we should file this
on the curator tracking system)



A meta-comment on Xrefs:

    We've put _alot_ of effort in getting the chain

     Havana (manual curation)

       -> Ensembl Merge (A principled merge between Havana and Ensembl)

       <-> Uniprot

       <-> HGNC

       <-> EntrezGene

this is both for data items (gene structures and protein sequence)
and Xrefs, in which there are a variety of different rules for how
the Xrefs propagate between different scenarios. A key feature is
that as we have manual curation in the Havana set, which includes
HGNC Xref assignment, and as the gene models in Havana+Ensembl are  
locked
in with the RefSeq models using the CCDS project (see Genome Research
paper), and the RefSeq<=>EntrezGene<=>HGNC links are all kept straight,
we have a progressive improvement _and_ a way of fixing manually
issues which can occur.

I know that we hope in e61 we have changed the Xref system in Human
(and mouse?) to effectively be locked in more between Ensembl<->Uniprot
(a separate issue to this Ensembl<->EntrezGene) but long experience
is that we have to be pretty paranoid about the Xref mapping system
as - as you can see - getting it wrong _really_ annoys people (quite
rightly!).


Finally, the emphasis has been placed on getting everything totally
straight on protein coding genes. We have not put so much effort
into pseudogenes.




> Furtermore in a couple of instances I am searching for a known HGNC ID
> or Entrez ID and it is not contained in the Ensembl database.
>
> Gavin
>
> -----Original Message-----
> From: Ewan Birney [mailto:birney at ebi.ac.uk]
> Sent: 19 November 2010 15:09
> To: ian Longden
> Cc: Oliver, Gavin; dev
> Subject: [SPAM] - Re: [ensembl-dev] Annotation discrepancy - Bayesian
> Filter detected spam
>
>
> Ian -
>
> at least for 150000, the HGNC symbol exists, and maps via a gene with
> Havana
>
>
http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000243064;r=21 
> :
> 15646120-15735075
>
> But as this is a non-coding gene, it does not get the normal mapping I
> suspect.
>
>
> The HGNC locus says that it is a pseudogene:
>
> http://www.genenames.org/data/hgnc_data.php?hgnc_id=16022
>
>
> As does the EntrezGene case:
>
>
http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=ShowDetailView&Term
> ToSearch=150000
>
>
>
> So -
>
>   I think at least some of these are pseudogenes with symbols. We
> should probably at
> the very least inheriet the EntrezGene ID from the Havana->HGNC-
>> EntrezGene linkage
> in these scenarios.
>
>
>
> Second one also the same:
>
>
http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000234608;r=12 
> :
> 112277571-112280706
>
>
> Gavin -
>
>   I suspect this is the main issue here.
>
>
> On 19 Nov 2010, at 14:59, ian Longden wrote:
>
>> 150000, 51275, 55449, 57126,  503646, 8693 are all unmapped in human.
>> These will have entrys in the xref table but are not linked to any
>> genes.
>> I am not sure how the search is done but this may affect it.
>>
>> 984:-
>> Not sure what the problem is here we find the genes of interest but
>> this gene also has other EntrezGene ids.
>>
>> 8857:-
>> as above
>>
>> 9026:-
>> as above.
>>
>> Were you expecting only 1 EntrezGene per gene? In time i hope this
>> becomes true but as these are very similar the software cannot choose
>> between them and uses both.
>>
>> I think the data is correct but maybe the search is not giving you
>> exactly what you want.
>> We need to look at having the unmapped cases searchable.
>>
>> 0Ian.
>>
>> On Fri, Nov 19, 2010 at 11:55 AM, Oliver, Gavin
>> <gavin.oliver at almacgroup.com> wrote:
>>> I have a few more examples of discrepancies which will hopefully
>>> help.
>>>
>>>
>>>
>>> For all examples, the search was performed on Entrez ID but returned
>>> nothing.  I have looked a bit deeper into a handful of examples.
>>> Details
>>> below:
>>>
>>>
>>>
>>> Entrez ID 150000           Associated gene Symbol ABCC13 in
>>> database but
>>> with no associated entrez ID
>>>
>>> Entrez ID  51275            Associated gene symbol C12orf47 in
>>> database but
>>> no associated entrez ID
>>>
>>> Entrez ID  55449            Associated gene symbol C14orf167 in
>>> database but
>>> no associated entrez ID
>>>
>>> Entrez ID  57126            Associated gene symbol CD177 in
>>> database with no
>>> associated entrez id
>>>
>>> Entrez ID  984               Associated gene symbol CDK11B is not in
>>> database.  CDK11A is in database but is annotated as cyclin-
>>> dependent kinase
>>> 11B with entrez id 100294398 which entrez describes as LOC100294398
>>> (cell
>>> division protein kinase 11B-like).
>>>
>>> Entrez ID  503646          Neither this ID nor associated gene
>>> symbol DPRXP5
>>> are in the database.
>>>
>>> Entrez ID  8857              Associated gene symbol FCGBP (Fc
>>> fragment of
>>> IgG binding protein) is there but with Entrez gene ID 100133944  
>>> which
>>> corresponds to LOC100133944 IgGFc-binding protein-like.
>>>
>>> Entrez ID  8693              Neither this ID nor associated gene
>>> symbol
>>> GALNT4 are in the database.
>>>
>>> Entrez ID  9026              Gene symbol HIP1R (huntingtin
>>> interacting
>>> protein 1 related) is in the database but with entrez ID 100294412
>>> which
>>> corresponds to huntingtin-interacting protein 1-related protein-like
>>>
>>>
>>>
>>> Best,
>>>
>>>
>>>
>>> Gavin
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ________________________________
>>>
>>> From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On
>>> Behalf Of
>>> Oliver, Gavin
>>> Sent: 19 November 2010 10:29
>>> To: dev at ensembl.org
>>> Subject: [ensembl-dev] Annotation discrepancy
>>>
>>>
>>>
>>> Hi all,
>>>
>>>
>>>
>>> I have been using Ensembl human for internal annotation of
>>> microarrays.
>>>
>>>
>>>
>>> Yesterday someone did a search for Entrez Gene ID 3336 in our
>>> database.  It
>>> returned no hits.
>>>
>>>
>>>
>>> When they searched with the Gene symbol for this ID (HSPE1), they
>>> got 5 hits
>>> but the Entrez ID associated with the gene was 100132346 (and not
>>> 3336 as
>>> would be expected).
>>>
>>>
>>>
>>> I ran a search for 100132346 against the Ensembl genome browser and
>>> it
>>> brings back 2 genes on 2 different chromosomes.
>>>
>>>
>>>
>>> Can someone explain what might be happening here?
>>>
>>>
>>>
>>> Best,
>>>
>>>
>>>
>>> Gavin
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list
>>> Dev at ensembl.org
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>
>>>
>>
>> _______________________________________________
>> Dev mailing list
>> Dev at ensembl.org
>> http://lists.ensembl.org/mailman/listinfo/dev
>
>
>






More information about the Dev mailing list