[ensembl-dev] Clarification for different stable IDs with same external name
Hiep Dang
hiep at bioturing.com
Thu Jul 13 03:42:21 BST 2023
Hi Jose,
Thank you for your response, it helps me better understand Emsembl's
annotation pipeline. I look forward to the release of Ensembl 110.
There is another unrelated question:
- As per my understanding, the stable ID version is expected to always
increase, and the older version will be retired. However, I came across a
case involving ENSG00000250765. In release 108, ENSG00000250765.1 was
mapped to ENSG00000250765.6 from release 85. Is this intentional or there
is something wrong with it?
Thanks,
Hiep
[image: image.png]
On Thu, Jun 29, 2023 at 10:27 PM Jose Gonzalez <jmgonzalez at ebi.ac.uk> wrote:
> Hi Hiep,
> Thank you for bringing this issue to our attention. Please find below my
> replies to your questions.
>
> On 22/06/2023 03:33, Hiep Dang wrote:
>
> Dear Ensembl Team,
>
> I am doing a project that needs to convert stable IDs to gene symbols.
> When I referenced the HGNC database, one symbol corresponds to only one
> stable ID. However, in the Ensembl database, one symbol can correspond to
> many stable IDs. I worry that using the HGNC reference will drop out
> some gene information. To clarify this problem, I investigate why some
> stable IDs share their gene names. For the human genes, I found that they
> will belong to 3 cases:
>
> 1. Stable IDs from non-primary assemblies:
> - These stable IDs will not be in the released GTF file (which contains
> chromosomes 1-22, X, Y, and MT). I can only retrieve these IDs from
> BioMart. This confuses me because, for a regular use case such as
> transcriptomic alignment and quantification, the input file is only the GTF
> file. So when should I consider using these IDs from other assemblies?
>
> Please note that a GTF file with the gene annotation of the alternate
> regions (patches and haplotypes) is also part of the Ensembl release files:
>
> http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.109.chr_patch_hapl_scaff.gtf.gz
>
> The use of primary assembly sequences is normally sufficient for
> transcriptomic alignment and quantification. The inclusion of alternate
> regions could lead to multi-mapping issues, since they are very similar to
> the corresponding sequences in the primary assembly. However, some users
> may be interested in the annotation on alternate regions. For instance,
> some gene annotations are known to be inaccurate because of underlying
> errors in the primary assembly, and the corrected annotations can be found
> in fix patches that have the corrected genomic sequences. Or genetics
> researchers may be interested in the variation of gene annotations in
> different haplotypes.
>
>
> - After dropping the stable IDs from non-primary assemblies, there are
> still about 1700 IDs that share the external gene name. Considering only
> the genes with their sources from HGNC or NCBI, they will fall into the
> following 2 cases.
>
>
> 2. Stable IDs with similar chromosomal positions:
> - For example: ENSG00000291019 (chr5: 178764861 - 178818435) and
> ENSG00000250420 (chr5: 178767204 - 178797611). They are both assigned to
> AACSP1 with a source from HGNC. However, the HGNC database only references
> ENSG00000250420.
>
> - Why do these two stable IDs exist at the same time? It seems like they
> are essentially one gene. In the future version, will one of them be
> retired?
>
> Most of these cases have their origin in a recent change in the way that
> we annotate transcribed pseudogenes. The pseudogene model, containing the
> homology with a coding gene, has been dissociated from the transcriptional
> evidence, which is now grouped in one or more lncRNA genes. An undesired
> side effect of this change is that both genes still share the same gene
> name. The pseudogene keeps the same stable ID, so it gets its name directly
> from HGNC, whereas the lncRNA gene gets the same name via NCBI based on its
> genomic overlap with the transcribed pseudogene annotated by RefSeq.
>
> Please note that this change only involves
> "transcribed_unprocessed_pseudogene" genes in release 109, but it will be
> extended to the remaining transcribed pseudogene biotypes
> ("transcribed_processed_pseudogene" and "transcribed_unitary_pseudogene")
> in release 110.
>
> In future releases, transcribed pseudogenes and lncRNAs will still be
> separate genes with their own IDs. On the other hand, HGNC are reluctant to
> assign the same gene symbol to more than one Ensembl stable ID. To fix
> these gene name duplicate issues, we will simply remove the current gene
> names from the lncRNA genes that overlap transcribed pseudogenes without
> giving them a new name. However, due to the Ensembl release cycle timing,
> this will not happen before release 112.
>
> Within the remaining set of duplicates, a few genes such as SPATA13 and
> SCARNA4 could certainly be merged to remove the duplication and we will
> look into fixing the annotation.
>
>
> 3. Stable IDs with different chromosomal positions:
> - For example: ENSG00000240356 (chr2: 113610502 - 113627090 - HGNC
> referenced) and ENSG00000291064 (chr22: 50756948 - 50801309 - NCBI
> referenced:118433
> <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=118433>).
> They are both assigned to RPL23AP7. The former is currently in the HGNC
> database. When I go to the NCBI website, the current position is chr2: 113611239-113627138,
> which is more similar to ENSG00000240356.
>
> - Will these NCBI-referenced genes be fixed in future releases?
>
> These cases seem to have been caused by the inaccurate name assignment to
> one of the genes because of their high sequence similarity to other genes
> in the same family. For instance, gene ENSG00000291064 should have been
> called RPL23AP82 instead of RPL23AP7. The pipeline seems to have taken into
> account the sequence identity with the NCBI genes of this family but not
> the genomic overlap with the NCBI gene RPL23AP82. We still need more time
> to investigate why this happened and will try to fix it in future releases.
>
> Just a heads-up that there will be another source of duplicate gene names
> since release 110 as chromosome Y PAR genes are now annotated separately,
> eg. they will have their own stable IDs but keep the same gene names as
> their chromosome X counterparts.
>
>
> I have attached the duplicated stable IDs for case 2 and case 3 that I
> retrieved from BioMart release 109.
> Thank you and I look forward to your response.
>
> Best,
> Hiep
>
> Please let me know if you have any further questions about this.
> Thanks,
> Jose
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/
>
> --
> Dr Jose M. Gonzalez
> GENCODE Bioinformatician (Genome Interpretation Team)
> European Bioinformatics Institute (EMBL-EBI)
> Wellcome Genome Campus
> Hinxton, CB10 1SD, UK
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20230713/0b8ad834/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 114061 bytes
Desc: not available
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20230713/0b8ad834/attachment-0001.png>
More information about the Dev
mailing list