[ensembl-dev] Clarification for different stable IDs with same external name

Hiep Dang hiep at bioturing.com
Thu Jun 22 03:33:27 BST 2023


Dear Ensembl Team,

I am doing a project that needs to convert stable IDs to gene symbols. When
I referenced the HGNC database, one symbol corresponds to only one stable
ID. However, in the Ensembl database, one symbol can correspond to many
stable IDs. I worry that using the HGNC reference will drop out some gene
information. To clarify this problem, I investigate why some stable IDs
share their gene names. For the human genes, I found that they will belong
to 3 cases:

1. Stable IDs from non-primary assemblies:
- These stable IDs will not be in the released GTF file (which contains
chromosomes 1-22, X, Y, and MT). I can only retrieve these IDs from
BioMart. This confuses me because, for a regular use case such as
transcriptomic alignment and quantification, the input file is only the GTF
file. So when should I consider using these IDs from other assemblies?

- After dropping the stable IDs from non-primary assemblies, there are
still about 1700 IDs that share the external gene name. Considering only
the genes with their sources from HGNC or NCBI, they will fall into the
following 2 cases.


2. Stable IDs with similar chromosomal positions:
- For example: ENSG00000291019 (chr5: 178764861 - 178818435) and
ENSG00000250420 (chr5: 178767204 - 178797611). They are both assigned to
AACSP1 with a source from HGNC. However, the HGNC database only references
ENSG00000250420.

- Why do these two stable IDs exist at the same time? It seems like they
are essentially one gene. In the future version, will one of them be
retired?


3. Stable IDs with different chromosomal positions:
- For example: ENSG00000240356 (chr2: 113610502 - 113627090 - HGNC
referenced) and ENSG00000291064 (chr22: 50756948 - 50801309 - NCBI
referenced:118433
<http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=118433>).
They are both assigned to RPL23AP7. The former is currently in the HGNC
database. When I go to the NCBI website, the current position is chr2:
113611239-113627138,
which is more similar to ENSG00000240356.

- Will these NCBI-referenced genes be fixed in future releases?

I have attached the duplicated stable IDs for case 2 and case 3 that I
retrieved from BioMart release 109.
Thank you and I look forward to your response.

Best,
Hiep
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20230622/213bb210/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: duplicated_ensg.csv
Type: text/csv
Size: 136425 bytes
Desc: not available
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20230622/213bb210/attachment-0001.csv>


More information about the Dev mailing list