[ensembl-dev] Difference in genomic coordinates between REFSEQ and ENSEMBL

Mon Feb 24 10:08:50 GMT 2014

I understand the difference in the definition. I probably failed to explain
my own understanding very well.

Yes... Your definition is what I agree with... so in layman's terms the
start of the gene coordinate would be the most upstream start of any
transcript (even if that transcript is not the biggest) and the end
coordinate would be the most downstream coordinate of any of the
transcripts found, again even if that transcript is not the largest in the
set)

This is the kind of definition I would like to have...
so that any refseq transcript of that gene should always be contained
within the ENSG coordinates for that gene correct?

In this case it is not valid.
So here is my question reformulated:
Can I not rely on the idea that the ENSEMBL gene coordinates will always
encompass any refseq transcript for the gene of interest?
In this case and in many other in my dataset it appears I cannot. And I
have many other examples if this in my dataset.

Best regards

Duarte

=========================
     Duarte Miguel Paulo Molha
         http://about.me/duarte
=========================

On Mon, Feb 24, 2014 at 9:55 AM, Andy Yates <ayates at ebi.ac.uk> wrote:

> Hi Duarte,
>
> Just to clarify one mis-conception here. Ensembl gene coordinates are the
> minimum start and maximum end of any transcript from the set linked to a
> gene (the coordinates which bound all transcripts). A gene's coordinates
> are not the same as its longest transcript model.
>
> That doesn't explain the discrepancy you've seen between NM_001101426.3
> and ENST00000407010. I can see from
> http://www.ensembl.org/Homo_sapiens/Share/17e6832cf57be0231caa268e919b3da4126347817that this is caused by a longer 3' UTR in the RefSeq model. I do not know
> why that's the case. Hopefully someone else on the list will have a better
> idea.
>
> Andy
>
> On 24 Feb 2014, at 09:09, Duarte Molha <duartemolha at gmail.com> wrote:
>
> > Dear Developers...
> >
> >
> > I was wondering if anyone of you could help me with an problem I am
> having comparing REFSEQ with ENSEMBL transcripts...
> >
> > I had assumed that the gene start and end coordinates in ensembl were
> obtained from the longest transcript model for each gene. However this does
> not seem to be the case when comparing as list of around 300 genes I have
> queried
> >
> >
> > Take a look at the example for transcript NM_001101426. In refseq this
> transcript has the coordinates: chr7:16127152-16460947. However if you
> search for it in Ensembl you get the transcript ENST00000407010 with the
> coordinates : chr7:16130817-16460947
> >
> > If we assume that ensembl would use the longest running transcript to
> determine the start and end coordinates then the ISPD gene should start at
> 16127152 and not at 16130817. There is a difference of almost 4KB. I
> understand the gene models are different and I would expect small
> differences between the two... but not a 4KB diference. Can you explain the
> discrepancy?
> > Best regards
> > Duarte
> >
> > =========================
> >      Duarte Miguel Paulo Molha
> >          http://about.me/duarte
> > =========================
> > _______________________________________________
> > Dev mailing list    Dev at ensembl.org
> > Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> > Ensembl Blog: http://www.ensembl.info/
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140224/1dcefee9/attachment.html>