[ensembl-dev] exon coordinate discrepancy between NCBI and Ensembl

Mon May 23 22:00:06 BST 2011

Hi Reece,

Thanks a lot for your kind words about Ensembl.  We definitely appreciate it.  

I hope this message provides some additional useful information.  Some others in the project may chime in with other details or to correct something below.

A lot of these differences come down to what exactly these objects are.  Often people think about this in a way that confuses objects that are actually different things.  Usually this does not matter for analysis, but as you have discovered at times it does.

In this context RefSeq mRNAs are objects that exist independently of the genome.  The only acquire their coordinates when they are mapped to the genome.  They may be withdrawn or changed.  If one does not use the dot version this change could be missed.

Ensembl genes are the result of a process that integrates a host of different evidence types on the genome sequence as a substrate for the alignment of proteins, cDNAs, mRNA, etc to create transcripts.  These transcripts are merged into genes and merged with Havana manually annotated transcripts.

These processes necessarily produce some differences.  In part because there are errors in the various data sets which are eventually corrected over time.  In part because the data sets are incomplete.  In part because the process of creating each resource is fundamentally different.

Once the Ensembl gene set is created we create "external references" to the RefSeq identifiers to identify those objects that are "biologically the same".  Note however, that a RefSeq that corresponds to an Ensembl gene does not mean that these are have identical placements on the genome assembly.

We all recognise that identifying those cases where there are no differences between the sets has value for the community and this is the basis of the CCDS project.

It may seem amazing that there is not a complete, definite set of human genes, but there is not.  There are very good reasons for these differences and there is not one that is "right".  This is one reason why Ensembl, RefSeq and other resources continue to exist.

Note also that creating a catalog of the way that Ensembl and RefSeq differ will only provide a shapshot on a given date.  Both Ensembl and RefSeq update often.

Paul

On 23 May 2011, at 18:48, Reece Hart wrote:

> On Mon, 2011-05-23 at 09:43 -0700, Kiran Mukhyala wrote:
>> As far as I understand the Ensembl model, a refseq_dna (an external
>> reference) linked to an ensembl transcript, does not mean their exons
>> are identical. NCBI's mapping of NM_023035.2 to the genome could be
>> different from Ensembl's mapping of ENST00000360228 because the
>> mapping methods are different and the sequences could also be
>> different.
>> 
>> Is there any other reason you expect the transcript models to be
>> identical?
> 
> On Mon, 2011-05-23 at 17:47 +0100, Susan Fairley wrote:
>> You note that both Ensembl and NCBI map rs58729888 to the same genomic
>> position. As the two transcript structures you are looking at differ, 
>> then the positions of rs58729888 in the two transcripts also differ
>> when viewed at the transcript level, although it is the same genomic
>> location.
> 
> 
> I understand the desire to group transcript based on various criteria,
> and such grouping is extremely valuable. There are many criteria on
> which we might call transcripts identical, similar, or different, such
> as exon structure, process transcript, or translation product. 
> 
> There are two reasons that I expected the exon structure to be
> *identical* at least over the CDS. First and most important is that exon
> structure is at the heart of mapping genome, transcript, and protein
> variants consistently within the community. If exon structure isn't
> preserved exactly for a given reference sequence, we won't be able to
> exchange variants reliably. The second reason is more subjective: as a
> design tenet, I believe that objects from a primary database should be
> represented as-is in any recapitulation. If I look up NM_023035.2, I
> want exactly that; anything else is something else.
> 
> Over the weekend, I undertook a much lengthier comparison of the exon
> structures in genomic coordinates for ~25K transcripts as fetched from
> Ensembl 61 and NCBI by NM accession. I'm still comparing, but here's
> what I know so far:
> 
>      * Of the 25K, 3190 (~12%) have exon structures that are identical
>        between the two sources.
>      * Another ~70% appear to differ in the 3' and 5' UTR exons only
>        (i.e., no CDS change)
>      * I've looked at ~50 of the remaining and found only a difference
>        in only one transcript (NM_000256.3) in which a pair of 3+18 nt
>        exons are substituted for a 21nt exon. (This is in addition to
>        NM_023035.2, in the original post).
>      * An Amazon-based instance with Ensembl on EBS storage is >100x
>        faster than remote lookups to NCBI (comparing single threaded
>        runs in both cases).
> 
> For the purposes of general and reliable mapping variants between
> genome, transcript, and protein coordinates, exact exon boundaries are
> essential. Similar CDS is necessary but not sufficient.
> 
> At this point, I would really like to understand the ways (and
> frequencies) in which Ensembl and NCBI transcripts differ. If somebody
> has this data, I'd love to see it. If not, I need to generate it.
> 
> For the record, this experience has only bolstered my affection for
> Ensembl. This is difficult stuff and they do a great job -- responsive
> help desk and community, a functional API, and sane schema are all very
> much appreciated. It took me ~5 minutes to write the code to fetch exon
> data from Ensembl. Thank you!
> 
> -Reece
> 
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/