[ensembl-dev] Probably duplicated human gene in latest release

Fergal fergal at ebi.ac.uk
Fri Dec 1 15:52:56 GMT 2017


Hi Wolf,

This is a rather complicated scenario. ENSG00000255292 is a readthrough gene. Readthrough genes are manually annotated by the Havana team and are made when there is some biological evidence of transcription of a single molecule that spans two distinct loci (in this case ENSG00000204370 and ENSG00000197580). This information can be seen on the gene summary via the annotation attribute “overlapping locus”, though it admittedly this is not very obvious. 

While there is experimental evidence for this occurring, it is unclear if such events have any true biological meaning. Often the assigned biotypes is non-sense mediated decay in these instances to signify that the product is not viable. 

As ENSG00000204370 and ENSG00000197580  are two distinct genes and ENSG00000255292 represents a readthrough event between them, the records should not be merged. However, as you’ve noted this scenario does then provide challenges for mapping pipelines when it comes to naming and cross-referencing.

One thing that does appear to have gone wrong is the inclusion of ENST00000614349.4 in the readthrough gene. This was added into the gene via our merge code, which bases the decision to merge automatically annotated transcripts into manually curated genes based on exon overlap. ENST00000614349.4 had the most exon overlap with one of the transcripts in the readtrhough gene and thus was merged in. We are going to add a rule avoid merging protein coding transcripts into readthrough genes to hopefully solve the issue in future releases.

A workaround (depending on what you’re doing) is to just filter readthrough genes out of you analysis. You can generate a list of readthroughs via the following SQL:

mysql -uanonymous -hensembldb.ensembl.org homo_sapiens_core_90_38 -NB -e "select distinct(concat(gene.stable_id,'.',gene.version)) from gene join transcript using(gene_id) join transcript_attrib using(transcript_id) where value='readthrough'"

This can also be done through the API by looking at the transcript attributes.

Hope this helps,

Fergal.


On 1 Dec 2017, at 14:56, Wolf Beat <Beat.Wolf at hefr.ch> wrote:

> Sorry, this is my fault. I was comparing all possible ensembl versions and copied the link from the wrong tab.
> 
> 
> So the correct links are:
> 
> 
> http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000204370;r=11:112086773-112120013
> 
> http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000255292;r=11:112086824-112193805
> 
> 
> Sorry for being incorrect in the last email with my links.
> 
> ________________________________
> From: Matthew Laird <lairdm at ebi.ac.uk>
> Sent: Friday, December 1, 2017 3:54:18 PM
> To: Ensembl developers list; Wolf Beat
> Subject: Re: [ensembl-dev] Probably duplicated human gene in latest release
> 
> Hello Wolf,
> 
> The latter link is for Ensembl release 75, which was the final release
> in which GRCh37 was used. So the records represented by those two links
> are for two different assemblies of the genome. Between that and the
> periodic updates that do happen to annotations between releases, it's
> not surprising the transcripts for the gene would be different. If you
> look in the gene history [1] page on the current release you can see the
> gene was updated and the version number incremented in Ensembl release 81.
> 
> If I'm misunderstanding your question, please let me know and we can try
> to resolve it. Cheers.
> 
> [1]
> http://www.ensembl.org/Homo_sapiens/Gene/Idhistory?db=core;g=ENSG00000204370;r=11:112086773-112120013
> 
> On 01/12/17 13:27, Wolf Beat wrote:
>> Hello,
>> 
>> 
>> i just noticed that the human gene SDHD. It does not have the same transcripts in both entries, but at least one protein coding gene is present in both. Also the description, including the HGNC Symbol is the same, which makes me think that this is some kind of error. Both entries should probably be merged. Here are the two entries for the same gene:
>> 
>> 
>> http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000204370;r=11:112086773-112120013
>> 
>> 
>> http://feb2014.archive.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000204370;r=11:111957497-111990353
>> 
>> 
>> Kind regards
>> 
>> 
>> Beat Wolf
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20171201/fb8bdd65/attachment.html>


More information about the Dev mailing list