[ensembl-dev] Clarification for different stable IDs with same external name

Mon Jul 17 07:05:36 BST 2023

Dear Jose,

Thank you for your reply. Your clear explanation has totally resolved my
concerns.

Best regards,
Hiep

On Sat, Jul 15, 2023 at 2:18 AM Jose Gonzalez <jmgonzalez at ebi.ac.uk> wrote:

> Hi Hiep,
>
> The version change from 6 to 1 was not intentional. As you would expect,
> versions should always increase and older versions must not be reused, so
> this looks like an anomalous behaviour of our stable id mapping.
>
> The annotation of this gene changed in an unusual way between releases 107
> and 108. To understand this, two aspects of our internal workflow must be
> considered:
>
> 1) The human gene annotation is being manually edited constantly in our
> internal database, from which snapshots or freezes are taken at regular
> intervals for the Ensembl releases.
>
> 2) Our internal database generates stable ids that must be honoured by the
> stable id mapping that assigns identifiers and versions in the Ensembl
> release. Versions, however, are assigned by the stable id mapping process
> by comparison with the previous release. It is done this way because the
> human gene annotation can undergo multiple changes between releases (hence
> multiple version increments in our internal database) but only single
> increments are expected between Ensembl releases.
>
> The gene ENSG0000025076 was a lncRNA gene until release 107. Then, our
> manual annotation added a pseudogene transcript, so the gene became a
> transcribed unprocessed pseudogene that also included the lncRNA
> transcript. Before the freeze for release 108, the lncRNA was separated
> from the pseudogene (as explained in my previous reply) and, to follow the
> standard procedure, the pseudogene kept the original gene id
> (ENSG0000025076), perhaps against common sense. When the stable id mapping
> process for release 108 compared this annotation with 107, it found that
> ENSG0000025076 was now a different gene with no common transcripts in 107,
> so it determined that this was a new gene and gave it version 1. However,
> it was forced to keep the gene id that came from our internal database.
>
> This looks like a possible bug in our code that wasn't able to handle an
> unexpected situation. We will need to investigate this further.
>
> Thank you for bringing this to our attention.
>
> Jose
>
>
> On 13/07/2023 03:42, Hiep Dang wrote:
>
> Hi Jose,
>
> Thank you for your response, it helps me better understand Emsembl's
> annotation pipeline. I look forward to the release of Ensembl 110.
>
> There is another unrelated question:
> - As per my understanding, the stable ID version is expected to always
> increase, and the older version will be retired. However, I came across a
> case involving ENSG00000250765. In release 108, ENSG00000250765.1 was
> mapped to ENSG00000250765.6 from release 85. Is this intentional or there
> is something wrong with it?
>
> Thanks,
> Hiep
>
>
> [image: image.png]
>
> On Thu, Jun 29, 2023 at 10:27 PM Jose Gonzalez <jmgonzalez at ebi.ac.uk>
> wrote:
>
>> Hi Hiep,
>> Thank you for bringing this issue to our attention. Please find below my
>> replies to your questions.
>> On 22/06/2023 03:33, Hiep Dang wrote:
>>
>> Dear Ensembl Team,
>>
>> I am doing a project that needs to convert stable IDs to gene symbols.
>> When I referenced the HGNC database, one symbol corresponds to only one
>> stable ID. However, in the Ensembl database, one symbol can correspond to
>> many stable IDs. I worry that using the HGNC reference will drop out
>> some gene information. To clarify this problem, I investigate why some
>> stable IDs share their gene names. For the human genes, I found that they
>> will belong to 3 cases:
>>
>> 1. Stable IDs from non-primary assemblies:
>> - These stable IDs will not be in the released GTF file (which contains
>> chromosomes 1-22, X, Y, and MT). I can only retrieve these IDs from
>> BioMart. This confuses me because, for a regular use case such as
>> transcriptomic alignment and quantification, the input file is only the GTF
>> file. So when should I consider using these IDs from other assemblies?
>>
>> Please note that a GTF file with the gene annotation of the alternate
>> regions (patches and haplotypes) is also part of the Ensembl release files:
>>
>> http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.109.chr_patch_hapl_scaff.gtf.gz
>>
>> The use of primary assembly sequences is normally sufficient for
>> transcriptomic alignment and quantification. The inclusion of alternate
>> regions could lead to multi-mapping issues, since they are very similar to
>> the corresponding sequences in the primary assembly. However, some users
>> may be interested in the annotation on alternate regions. For instance,
>> some gene annotations are known to be inaccurate because of underlying
>> errors in the primary assembly, and the corrected annotations can be found
>> in fix patches that have the corrected genomic sequences. Or genetics
>> researchers may be interested in the variation of gene annotations in
>> different haplotypes.
>>
>>
>> - After dropping the stable IDs from non-primary assemblies, there are
>> still about 1700 IDs that share the external gene name. Considering only
>> the genes with their sources from HGNC or NCBI, they will fall into the
>> following 2 cases.
>>
>>
>> 2. Stable IDs with similar chromosomal positions:
>> - For example: ENSG00000291019 (chr5: 178764861 - 178818435) and
>> ENSG00000250420 (chr5: 178767204 - 178797611). They are both assigned to
>> AACSP1 with a source from HGNC. However, the HGNC database only references
>> ENSG00000250420.
>>
>> - Why do these two stable IDs exist at the same time? It seems like they
>> are essentially one gene. In the future version, will one of them be
>> retired?
>>
>> Most of these cases have their origin in a recent change in the way that
>> we annotate transcribed pseudogenes. The pseudogene model, containing the
>> homology with a coding gene, has been dissociated from the transcriptional
>> evidence, which is now grouped in one or more lncRNA genes. An undesired
>> side effect of this change is that both genes still share the same gene
>> name. The pseudogene keeps the same stable ID, so it gets its name directly
>> from HGNC, whereas the lncRNA gene gets the same name via NCBI based on its
>> genomic overlap with the transcribed pseudogene annotated by RefSeq.
>>
>> Please note that this change only involves
>> "transcribed_unprocessed_pseudogene" genes in release 109, but it will be
>> extended to the remaining transcribed pseudogene biotypes
>> ("transcribed_processed_pseudogene" and "transcribed_unitary_pseudogene")
>> in release 110.
>>
>> In future releases, transcribed pseudogenes and lncRNAs will still be
>> separate genes with their own IDs. On the other hand, HGNC are reluctant to
>> assign the same gene symbol to more than one Ensembl stable ID. To fix
>> these gene name duplicate issues, we will simply remove the current gene
>> names from the lncRNA genes that overlap transcribed pseudogenes without
>> giving them a new name. However, due to the Ensembl release cycle timing,
>> this will not happen before release 112.
>>
>> Within the remaining set of duplicates, a few genes such as SPATA13 and
>> SCARNA4 could certainly be merged to remove the duplication and we will
>> look into fixing the annotation.
>>
>>
>> 3. Stable IDs with different chromosomal positions:
>> - For example: ENSG00000240356 (chr2: 113610502 - 113627090 - HGNC
>> referenced) and ENSG00000291064 (chr22: 50756948 - 50801309 - NCBI
>> referenced:118433
>> <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=118433>).
>> They are both assigned to RPL23AP7. The former is currently in the HGNC
>> database. When I go to the NCBI website, the current position is chr2: 113611239-113627138,
>> which is more similar to ENSG00000240356.
>>
>> - Will these NCBI-referenced genes be fixed in future releases?
>>
>> These cases seem to have been caused by the inaccurate name assignment to
>> one of the genes because of their high sequence similarity to other genes
>> in the same family. For instance, gene ENSG00000291064 should have been
>> called RPL23AP82 instead of RPL23AP7. The pipeline seems to have taken into
>> account the sequence identity with the NCBI genes of this family but not
>> the genomic overlap with the NCBI gene RPL23AP82. We still need more time
>> to investigate why this happened and will try to fix it in future releases.
>>
>> Just a heads-up that there will be another source of duplicate gene names
>> since release 110 as chromosome Y PAR genes are now annotated separately,
>> eg. they will have their own stable IDs but keep the same gene names as
>> their chromosome X counterparts.
>>
>>
>> I have attached the duplicated stable IDs for case 2 and case 3 that I
>> retrieved from BioMart release 109.
>> Thank you and I look forward to your response.
>>
>> Best,
>> Hiep
>>
>> Please let me know if you have any further questions about this.
>> Thanks,
>> Jose
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>> Ensembl Blog: http://www.ensembl.info/
>>
>> --
>> Dr Jose M. Gonzalez
>> GENCODE Bioinformatician (Genome Interpretation Team)
>> European Bioinformatics Institute (EMBL-EBI)
>> Wellcome Genome Campus
>> Hinxton, CB10 1SD, UK
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>> Ensembl Blog: http://www.ensembl.info/
>>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20230717/c8279b87/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 114061 bytes
Desc: not available
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20230717/c8279b87/attachment-0001.png>