[ensembl-dev] Probably duplicated human gene in latest release

Wolf Beat Beat.Wolf at hefr.ch
Fri Dec 1 15:56:04 GMT 2017


Thank you very much for the detailed answer. Will this be fixed for a future release? (specifically the ENST00000614349.4 transcript).


I the meantime the idea to filter read through genes (although i will have to read up a little more about what exactly it really means, even if your explanation is already quite good). I will check if i can do this though the biomart interface.


Kind regards


Beat Wolf

________________________________
From: Dev <dev-bounces at ensembl.org> on behalf of Fergal <fergal at ebi.ac.uk>
Sent: Friday, December 1, 2017 4:52:56 PM
To: Ensembl developers list
Subject: Re: [ensembl-dev] Probably duplicated human gene in latest release

Hi Wolf,

This is a rather complicated scenario. ENSG00000255292 is a readthrough gene. Readthrough genes are manually annotated by the Havana team and are made when there is some biological evidence of transcription of a single molecule that spans two distinct loci (in this case ENSG00000204370 and ENSG00000197580). This information can be seen on the gene summary via the annotation attribute “overlapping locus”, though it admittedly this is not very obvious.

While there is experimental evidence for this occurring, it is unclear if such events have any true biological meaning. Often the assigned biotypes is non-sense mediated decay in these instances to signify that the product is not viable.

As ENSG00000204370 and ENSG00000197580  are two distinct genes and ENSG00000255292 represents a readthrough event between them, the records should not be merged. However, as you’ve noted this scenario does then provide challenges for mapping pipelines when it comes to naming and cross-referencing.

One thing that does appear to have gone wrong is the inclusion of ENST00000614349.4 in the readthrough gene. This was added into the gene via our merge code, which bases the decision to merge automatically annotated transcripts into manually curated genes based on exon overlap. ENST00000614349.4 had the most exon overlap with one of the transcripts in the readtrhough gene and thus was merged in. We are going to add a rule avoid merging protein coding transcripts into readthrough genes to hopefully solve the issue in future releases.

A workaround (depending on what you’re doing) is to just filter readthrough genes out of you analysis. You can generate a list of readthroughs via the following SQL:

mysql -uanonymous -hensembldb.ensembl.org<http://hensembldb.ensembl.org> homo_sapiens_core_90_38 -NB -e "select distinct(concat(gene.stable_id,'.',gene.version)) from gene join transcript using(gene_id) join transcript_attrib using(transcript_id) where value='readthrough'"

This can also be done through the API by looking at the transcript attributes.

Hope this helps,

Fergal.


On 1 Dec 2017, at 14:56, Wolf Beat <Beat.Wolf at hefr.ch<mailto:Beat.Wolf at hefr.ch>> wrote:

Sorry, this is my fault. I was comparing all possible ensembl versions and copied the link from the wrong tab.


So the correct links are:


http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000204370;r=11:112086773-112120013

http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000255292;r=11:112086824-112193805


Sorry for being incorrect in the last email with my links.

________________________________
From: Matthew Laird <lairdm at ebi.ac.uk>
Sent: Friday, December 1, 2017 3:54:18 PM
To: Ensembl developers list; Wolf Beat
Subject: Re: [ensembl-dev] Probably duplicated human gene in latest release

Hello Wolf,

The latter link is for Ensembl release 75, which was the final release
in which GRCh37 was used. So the records represented by those two links
are for two different assemblies of the genome. Between that and the
periodic updates that do happen to annotations between releases, it's
not surprising the transcripts for the gene would be different. If you
look in the gene history [1] page on the current release you can see the
gene was updated and the version number incremented in Ensembl release 81.

If I'm misunderstanding your question, please let me know and we can try
to resolve it. Cheers.

[1]
http://www.ensembl.org/Homo_sapiens/Gene/Idhistory?db=core;g=ENSG00000204370;r=11:112086773-112120013

On 01/12/17 13:27, Wolf Beat wrote:
Hello,


i just noticed that the human gene SDHD. It does not have the same transcripts in both entries, but at least one protein coding gene is present in both. Also the description, including the HGNC Symbol is the same, which makes me think that this is some kind of error. Both entries should probably be merged. Here are the two entries for the same gene:


http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000204370;r=11:112086773-112120013


http://feb2014.archive.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000204370;r=11:111957497-111990353


Kind regards


Beat Wolf
_______________________________________________
Dev mailing list    Dev at ensembl.org
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/

_______________________________________________
Dev mailing list    Dev at ensembl.org
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/




More information about the Dev mailing list