[ensembl-dev] [Eva-dev] ATXN8 gene missing from Ensembl

Fri May 8 03:14:21 BST 2020

Interesting indeed.  From what I see in DQ641254 
(https://www.ebi.ac.uk/ena/data/view/DQ641254.1), the only evidence 
appears to be the 3'-RACE from the Nature paper.  There is a CDS fragment 
annotated, but it consists basically exclusively of Glutamine residues 
(repeated CAG triplets preceeded by ATG and followed by TAG).  I guess 
this lead to it being called an ataxin, since the CAG repeat is apparently 
a defining feature.  But (according to Wikipedia) regular ataxins have 
several other domains (in Ataxin 1 [ENSG00000124788] the poly-Q is in the 
middle of the protein), and since in AL160391.1 there appears to be stop 
codon at the end of the poly-Q sequence (which is preceded by an ATG, 
which may or may not be the start of the CDS), there may be nothing else 
here to build a functional ataxin.  (Disclaimer: I know nothing about 
ataxins; I just just responded to your e-mail because I poked around out 
of curiosity though I might as well share what I saw.)

I have no idea why AL160391 would be annotated as lncRNA, may that is some 
kind of default for insufficiently characterized transcripts?  The Ensembl 
developers may know more about that.

If the presence or absence of this "ATXN8" gene is important to you, I 
guess the only way to learn more is to try to find more evidence.  Short 
of doing lab experiments to characterize the transcript beyond the 3'-end, 
the only thing I can think of is looking into RNA-Seq data (microarray is 
unlikely to be very useful) from public databases (GEO, Expression Atlas). 
This is certainly not trivial, given the multitude of data sets, 
platforms, and annotations.  It might be possible to restrict the data 
sets by tissue.  I would probably not rely on exiting counts tables, as 
those might not contain AL160391.  To process a large number of samples 
quickly, I would probably create a reference with just the region of 
interest to map against, but the (CAG)n repeat likely won't be helpful, as 
it might attract all kinds of similar repeat sequences, never mind that 
these repeats seem to be of variable length in ataxins, which won't help 
the mapping either.

Good luck!

Thomas

On Fri, 8 May 2020, Kirill Tsukanov wrote:

> Hi Thomas,
>
> Thank you for a quick reply. I have looked into this case further, and it 
> only got more interesting.
>
> A paper <https://www.nature.com/articles/ng1827> in Nature which led to the 
> discovery of this gene states that there are two transcripts spanning the 
> (CTG)n repeat in 13q21.33 in the opposite directions:
>
> 1. *ATXN8OS* (a. k. a. SCA8 & KLHL1AS), a lncRNA;
> 2. *ATXN8,* a coding, nearly pure polyglutamine expansion protein.
>
> The GenBank record DQ641254 
> <https://www.ncbi.nlm.nih.gov/nuccore/DQ641254?report=GenBank> for ATXN8 has 
> the comment: “The sequence is derived from 3'-RACE analysis of the ATXN8 
> transcript. The 5'-end of ATXN8 mRNA is not yet defined." So this is what the 
> gene status seems to reflect—that it does not have a /complete/ genomic 
> mapping and annotation, not that it is invalid.
>
> In the UCSC genome browser, this partial mRNA sequence is displayed in the 
> GENCODE v32 transcript set under the accession AL160391.1:
>
> ATXN8 region in UCSC genome browser
> /(In case mailing lists won't keep the picture, here's a direct URL for a 
> copy: //https://i.imgur.com/iGVwihX.png//)/
>
> Now, if we follow up on accession AL160391.1, we will find that it is linked 
> to Ensembl gene ENSG00000288330 
> <http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000288330;r=13:70137831-70139431;t=ENST00000673087> 
> with the same name (AL160391.1) and description "ataxin 8". This appears to 
> be the missing ATXN8 gene: it's there, it is just not linked to the HGNC ID 
> (HGNC:32925 
> <https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:32925>) 
> and name. Also, the gene type is wrong: it is registered as lncRNA, while in 
> reality is it a mRNA. The record is stated to have been manually annotated, 
> so this appears to be a human error caused by confusion between the ATXN8 and 
> ATXN8OS (which really /is/ lncRNA and is correctly annotated as such).
>
> Please let me know what you think about this.
>
> Best,
> Kirill
>
> On 06/05/2020 00:09, Thomas Danhorn wrote:
>> Hi Kirill,
>> 
>> On the NCBI site for ATXN8 you linked to it says "not in current
>> annotation release", so it looks like it may have once been considered a
>> valid gene, but not anymore.  I have also looked through a few of the
>> older Ensembl releases and none of them have ATXN8 on chromosome 13 (so
>> this is not an omission in the new release).  The ones based on the
>> GRCh37/hg19 assembly (Ensembl versions 75 and older) have "ATXN8" as a
>> synonym of ENSG00000107815, but that is on chromosome 10, so I doubt that
>> is what you are looking for.
>> 
>> Hope this helps,
>> 
>> Thomas
>
> On Tue, 5 May 2020, Kirill Tsukanov wrote:
>> Hi,
>> 
>> I have a quick question about a data issue. I noticed that Ensembl 100
>> includes ATXN8OS gene
>> <http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000230223;r=13:70107213-70149092> 
>> (opposite strand lncRNA), but not the ATXN8 gene itself. The latter is
>> present in NCBI Gene (https://www.ncbi.nlm.nih.gov/gene/724066), but not in
>> Ensembl. This is unfortunate because it means that I can't use it in an 
>> Open
>> Targets submission as it does not have an Ensembl gene ID associated with 
>> it.
>> 
>> Do you know if there's a specific reason why this gene is missing? Can we
>> expect it to be added in later Ensembl releases?
>> 
>> -- 
>> Best,
>> Kirill from the European Variation Archive
>> 
>