[ensembl-dev] Clarification for different stable IDs with same external name

Fri Jul 14 20:16:30 BST 2023

Hi Hiep,

The version change from 6 to 1 was not intentional. As you would expect, 
versions should always increase and older versions must not be reused, 
so this looks like an anomalous behaviour of our stable id mapping.

The annotation of this gene changed in an unusual way between releases 
107 and 108. To understand this, two aspects of our internal workflow 
must be considered:

1) The human gene annotation is being manually edited constantly in our 
internal database, from which snapshots or freezes are taken at regular 
intervals for the Ensembl releases.

2) Our internal database generates stable ids that must be honoured by 
the stable id mapping that assigns identifiers and versions in the 
Ensembl release. Versions, however, are assigned by the stable id 
mapping process by comparison with the previous release. It is done this 
way because the human gene annotation can undergo multiple changes 
between releases (hence multiple version increments in our internal 
database) but only single increments are expected between Ensembl releases.

The gene ENSG0000025076 was a lncRNA gene until release 107. Then, our 
manual annotation added a pseudogene transcript, so the gene became a 
transcribed unprocessed pseudogene that also included the lncRNA 
transcript. Before the freeze for release 108, the lncRNA was separated 
from the pseudogene (as explained in my previous reply) and, to follow 
the standard procedure, the pseudogene kept the original gene id 
(ENSG0000025076), perhaps against common sense. When the stable id 
mapping process for release 108 compared this annotation with 107, it 
found that ENSG0000025076 was now a different gene with no common 
transcripts in 107, so it determined that this was a new gene and gave 
it version 1. However, it was forced to keep the gene id that came from 
our internal database.

This looks like a possible bug in our code that wasn't able to handle an 
unexpected situation. We will need to investigate this further.

Thank you for bringing this to our attention.

Jose

On 13/07/2023 03:42, Hiep Dang wrote:
> Hi Jose,
>
> Thank you for your response, it helps me better understand Emsembl's 
> annotation pipeline. I look forward to the release of Ensembl 110.
>
> There is another unrelated question:
> - As per my understanding, the stable ID version is expected to always 
> increase, and the older version will be retired. However, I came 
> across a case involving ENSG00000250765. In release 108, 
> ENSG00000250765.1 was mapped to ENSG00000250765.6 from release 85. Is 
> this intentional or there is something wrong with it?
>
> Thanks,
> Hiep
>
>
> image.png
>
> On Thu, Jun 29, 2023 at 10:27 PM Jose Gonzalez <jmgonzalez at ebi.ac.uk> 
> wrote:
>
>     Hi Hiep,
>
>     Thank you for bringing this issue to our attention. Please find
>     below my replies to your questions.
>     On 22/06/2023 03:33, Hiep Dang wrote:
>>     Dear Ensembl Team,
>>
>>     I am doing a project that needs to convert stable IDs to gene
>>     symbols. When I referenced the HGNC database, one symbol
>>     corresponds to only one stable ID. However, in the Ensembl
>>     database, one symbol can correspond to many stable IDs. I worry
>>     that using the HGNC reference will drop out some gene
>>     information. To clarify this problem, I investigate why some
>>     stable IDs share their gene names. For the human genes, I found
>>     that they will belong to 3 cases:
>>
>>     1. Stable IDs from non-primary assemblies:
>>     - These stable IDs will not be in the released GTF file (which
>>     contains chromosomes 1-22, X, Y, and MT). I can only retrieve
>>     these IDs from BioMart. This confuses me because, for a regular
>>     use case such as transcriptomic alignment and quantification, the
>>     input file is only the GTF file. So when should I consider using
>>     these IDs from other assemblies?
>     Please note that a GTF file with the gene annotation of the
>     alternate regions (patches and haplotypes) is also part of the
>     Ensembl release files:
>     http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.109.chr_patch_hapl_scaff.gtf.gz
>
>
>     The use of primary assembly sequences is normally sufficient for
>     transcriptomic alignment and quantification. The inclusion of
>     alternate regions could lead to multi-mapping issues, since they
>     are very similar to the corresponding sequences in the primary
>     assembly. However, some users may be interested in the annotation
>     on alternate regions. For instance, some gene annotations are
>     known to be inaccurate because of underlying errors in the primary
>     assembly, and the corrected annotations can be found in fix
>     patches that have the corrected genomic sequences. Or genetics
>     researchers may be interested in the variation of gene annotations
>     in different haplotypes.
>
>>
>>     - After dropping the stable IDs from non-primary assemblies,
>>     there are still about 1700 IDs that share the external gene name.
>>     Considering only the genes with their sources from HGNC or NCBI,
>>     they will fall into the following 2 cases.
>>
>>
>>     2. Stable IDs with similar chromosomal positions:
>>     - For example: ENSG00000291019 (chr5: 178764861 - 178818435) and
>>     ENSG00000250420 (chr5: 178767204 - 178797611). They are both
>>     assigned to AACSP1 with a source from HGNC. However, the HGNC
>>     database only references ENSG00000250420.
>>
>>     - Why do these two stable IDs exist at the same time? It seems
>>     like they are essentially one gene. In the future version, will
>>     one of them be retired?
>>
>     Most of these cases have their origin in a recent change in the
>     way that we annotate transcribed pseudogenes. The pseudogene
>     model, containing the homology with a coding gene, has been
>     dissociated from the transcriptional evidence, which is now
>     grouped in one or more lncRNA genes. An undesired side effect of
>     this change is that both genes still share the same gene name. The
>     pseudogene keeps the same stable ID, so it gets its name directly
>     from HGNC, whereas the lncRNA gene gets the same name via NCBI
>     based on its genomic overlap with the transcribed pseudogene
>     annotated by RefSeq.
>
>     Please note that this change only involves
>     "transcribed_unprocessed_pseudogene" genes in release 109, but it
>     will be extended to the remaining transcribed pseudogene biotypes
>     ("transcribed_processed_pseudogene" and
>     "transcribed_unitary_pseudogene") in release 110.
>
>     In future releases, transcribed pseudogenes and lncRNAs will still
>     be separate genes with their own IDs. On the other hand, HGNC are
>     reluctant to assign the same gene symbol to more than one Ensembl
>     stable ID. To fix these gene name duplicate issues, we will simply
>     remove the current gene names from the lncRNA genes that overlap
>     transcribed pseudogenes without giving them a new name. However,
>     due to the Ensembl release cycle timing, this will not happen
>     before release 112.
>
>     Within the remaining set of duplicates, a few genes such as
>     SPATA13 and SCARNA4 could certainly be merged to remove the
>     duplication and we will look into fixing the annotation.
>
>
>>     3. Stable IDs with different chromosomal positions:
>>     - For example: ENSG00000240356 (chr2: 113610502 - 113627090 -
>>     HGNC referenced) and ENSG00000291064 (chr22: 50756948 - 50801309
>>     - NCBI referenced:118433
>>     <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=118433>).
>>     They are both assigned to RPL23AP7. The former is currently in
>>     the HGNC database. When I go to the NCBI website, the current
>>     position is chr2: 113611239-113627138, which is more similar to
>>     ENSG00000240356.
>>
>>     - Will these NCBI-referenced genes be fixed in future releases?
>     These cases seem to have been caused by the inaccurate name
>     assignment to one of the genes because of their high sequence
>     similarity to other genes in the same family. For instance, gene
>     ENSG00000291064 should have been called RPL23AP82 instead of
>     RPL23AP7. The pipeline seems to have taken into account the
>     sequence identity with the NCBI genes of this family but not the
>     genomic overlap with the NCBI gene RPL23AP82. We still need more
>     time to investigate why this happened and will try to fix it in
>     future releases.
>
>     Just a heads-up that there will be another source of duplicate
>     gene names since release 110 as chromosome Y PAR genes are now
>     annotated separately, eg. they will have their own stable IDs but
>     keep the same gene names as their chromosome X counterparts.
>
>>
>>     I have attached the duplicated stable IDs for case 2 and case 3
>>     that I retrieved from BioMart release 109.
>>     Thank you and I look forward to your response.
>>
>>     Best,
>>     Hiep
>>
>     Please let me know if you have any further questions about this.
>
>     Thanks,
>     Jose
>
>
>>     _______________________________________________
>>     Dev mailing listDev at ensembl.org
>>     Posting guidelines and subscribe/unsubscribe info:https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>>     Ensembl Blog:http://www.ensembl.info/
>
>     -- 
>     Dr Jose M. Gonzalez
>     GENCODE Bioinformatician (Genome Interpretation Team)
>     European Bioinformatics Institute (EMBL-EBI)
>     Wellcome Genome Campus
>     Hinxton, CB10 1SD, UK
>
>     _______________________________________________
>     Dev mailing list Dev at ensembl.org
>     Posting guidelines and subscribe/unsubscribe info:
>     https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>     Ensembl Blog: http://www.ensembl.info/
>
>
> _______________________________________________
> Dev mailing listDev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog:http://www.ensembl.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20230714/2af1d662/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 114061 bytes
Desc: not available
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20230714/2af1d662/attachment-0001.png>