[ensembl-dev] Clarification for different stable IDs with same external name

Thu Jun 29 15:51:06 BST 2023

Hi Hiep,

Thank you for bringing this issue to our attention. Please find below my 
replies to your questions.

On 22/06/2023 03:33, Hiep Dang wrote:
> Dear Ensembl Team,
>
> I am doing a project that needs to convert stable IDs to gene symbols. 
> When I referenced the HGNC database, one symbol corresponds to only 
> one stable ID. However, in the Ensembl database, one symbol can 
> correspond to many stable IDs. I worry that using the HGNC reference 
> will drop out some gene information. To clarify this problem, I 
> investigate why some stable IDs share their gene names. For the human 
> genes, I found that they will belong to 3 cases:
>
> 1. Stable IDs from non-primary assemblies:
> - These stable IDs will not be in the released GTF file (which 
> contains chromosomes 1-22, X, Y, and MT). I can only retrieve these 
> IDs from BioMart. This confuses me because, for a regular use case 
> such as transcriptomic alignment and quantification, the input file is 
> only the GTF file. So when should I consider using these IDs from 
> other assemblies?
Please note that a GTF file with the gene annotation of the alternate 
regions (patches and haplotypes) is also part of the Ensembl release files:
http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.109.chr_patch_hapl_scaff.gtf.gz 

The use of primary assembly sequences is normally sufficient for 
transcriptomic alignment and quantification. The inclusion of alternate 
regions could lead to multi-mapping issues, since they are very similar 
to the corresponding sequences in the primary assembly. However, some 
users may be interested in the annotation on alternate regions. For 
instance, some gene annotations are known to be inaccurate because of 
underlying errors in the primary assembly, and the corrected annotations 
can be found in fix patches that have the corrected genomic sequences. 
Or genetics researchers may be interested in the variation of gene 
annotations in different haplotypes.

>
> - After dropping the stable IDs from non-primary assemblies, there are 
> still about 1700 IDs that share the external gene name. Considering 
> only the genes with their sources from HGNC or NCBI, they will fall 
> into the following 2 cases.
>
>
> 2. Stable IDs with similar chromosomal positions:
> - For example: ENSG00000291019 (chr5: 178764861 - 178818435) and 
> ENSG00000250420 (chr5: 178767204 - 178797611). They are both assigned 
> to AACSP1 with a source from HGNC. However, the HGNC database only 
> references ENSG00000250420.
>
> - Why do these two stable IDs exist at the same time? It seems like 
> they are essentially one gene. In the future version, will one of them 
> be retired?
>
Most of these cases have their origin in a recent change in the way that 
we annotate transcribed pseudogenes. The pseudogene model, containing 
the homology with a coding gene, has been dissociated from the 
transcriptional evidence, which is now grouped in one or more lncRNA 
genes. An undesired side effect of this change is that both genes still 
share the same gene name. The pseudogene keeps the same stable ID, so it 
gets its name directly from HGNC, whereas the lncRNA gene gets the same 
name via NCBI based on its genomic overlap with the transcribed 
pseudogene annotated by RefSeq.

Please note that this change only involves 
"transcribed_unprocessed_pseudogene" genes in release 109, but it will 
be extended to the remaining transcribed pseudogene biotypes 
("transcribed_processed_pseudogene" and 
"transcribed_unitary_pseudogene") in release 110.

In future releases, transcribed pseudogenes and lncRNAs will still be 
separate genes with their own IDs. On the other hand, HGNC are reluctant 
to assign the same gene symbol to more than one Ensembl stable ID. To 
fix these gene name duplicate issues, we will simply remove the current 
gene names from the lncRNA genes that overlap transcribed pseudogenes 
without giving them a new name. However, due to the Ensembl release 
cycle timing, this will not happen before release 112.

Within the remaining set of duplicates, a few genes such as SPATA13 and 
SCARNA4 could certainly be merged to remove the duplication and we will 
look into fixing the annotation.
>
> 3. Stable IDs with different chromosomal positions:
> - For example: ENSG00000240356 (chr2: 113610502 - 113627090 - HGNC 
> referenced) and ENSG00000291064 (chr22: 50756948 - 50801309 - NCBI 
> referenced:118433 
> <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=118433>). 
> They are both assigned to RPL23AP7. The former is currently in the 
> HGNC database. When I go to the NCBI website, the current position is 
> chr2: 113611239-113627138, which is more similar to ENSG00000240356.
>
> - Will these NCBI-referenced genes be fixed in future releases?

These cases seem to have been caused by the inaccurate name assignment 
to one of the genes because of their high sequence similarity to other 
genes in the same family. For instance, gene ENSG00000291064 should have 
been called RPL23AP82 instead of RPL23AP7. The pipeline seems to have 
taken into account the sequence identity with the NCBI genes of this 
family but not the genomic overlap with the NCBI gene RPL23AP82. We 
still need more time to investigate why this happened and will try to 
fix it in future releases.

Just a heads-up that there will be another source of duplicate gene 
names since release 110 as chromosome Y PAR genes are now annotated 
separately, eg. they will have their own stable IDs but keep the same 
gene names as their chromosome X counterparts.

>
> I have attached the duplicated stable IDs for case 2 and case 3 that I 
> retrieved from BioMart release 109.
> Thank you and I look forward to your response.
>
> Best,
> Hiep
>
Please let me know if you have any further questions about this.

Thanks,
Jose

> _______________________________________________
> Dev mailing listDev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog:http://www.ensembl.info/

-- 
Dr Jose M. Gonzalez
GENCODE Bioinformatician (Genome Interpretation Team)
European Bioinformatics Institute (EMBL-EBI)
Wellcome Genome Campus
Hinxton, CB10 1SD, UK
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20230629/b9df01d1/attachment.html>